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ABSTRACT 


A new Tropical Cyclone (TC) surface wind speed probability product from the 
National Hurricane Center (NHC) takes into account uncertainty in track, maximum wind 
speed, and wind radii. A Monte Carlo (MC) model is used that draws from probability 
distributions based on historic track errors. In this thesis, distributions of forecast track 
errors conditioned on forecast confidence are examined to detennine if significant 
differences exist in distribution characteristics. Two predictors are used to define forecast 
confidence: the Goerss Predicted Consensus Error (GPCE) and the Global Forecast 
System (GFS) ensemble spread. The distributions of total-, along-, and cross-track errors 
from NHC official forecasts are defined for low, average, and high forecast confidence. 
Also, distributions of the GFS ensemble mean total-track errors are defined based on 
similar confidence levels. Standard hypothesis testing methods are used to examine 
distribution characteristics. Using the GPCE values, significant differences in nearly all 
track error distributions existed for each level of forecast confidence. The GFS ensemble 
spread did not provide a basis for statistically different distributions. These results 
suggest that the NHC probability model would likely be improved if the MC model 
would draw from distributions of track errors based on the GPCE measures of forecast 
confidence. 
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I. INTRODUCTION 


A. MOTIVATION AND OBJECTIVE 

In both the civilian and military world, weather phenomena can be classified by 
degrees of impact. An example of a low degree of impact can be a crosswind 
(component of the wind blowing perpendicular to the landing surface) of 20 knots. This 
may ground certain types of sensitive aircraft such as a U-2, but a Boeing 747 or a 
military C-17 can continue to operate. This weather phenomenon can occur often, but its 
impact is minimal. However, a weather phenomenon such as a Tropical Cyclone (TC) 
can have dangerous winds, flood-causing heavy rain, and destructive stonn surge over 
large areas so the degree of impact will be high. An approaching hurricane can curtail 
the civilian economy, military operations, and all routine day-to-day living along several 
hundred miles of coastline and inland areas. The duration of the impact can be several 
days to several months depending on the damage. This weather phenomenon rarely 
occurs but its impacts can be total devastation. 

The 2004 and 2005 Atlantic hurricane seasons demonstrated all too well the 
impacts of hurricanes on both civilian and military centers. For example, in 2005, 
Hurricane Katrina destroyed several hundred miles of coastline including the city of New 
Orleans and Keesler Air Force Base (AFB). Destruction in New Orleans cost billions of 
dollars and caused more then 1,000 deaths. It will take many years to restore the city to 
its pre-Katrina status. Keesler AFB experienced a large stonn surge that caused heavy 
damage to almost every structure. As a training hub for the Air Force, Keesler AFB 
trains thousands of Ainnen every year, especially new personnel receiving their initial 
career training. With the destruction caused from Hurricane Katrina, the mission was 
forced to stop for a few months, which caused a huge impact on the Air Force in general. 

The best information available needs to be supplied to the civilian and military 
worlds to protect lives and allow preparations that will minimize the damage. The goal 
of this thesis is to improve this information. 

A 2003 Joint Hurricane Testbed project funded Dr. Mark DeMaria, Dr. John 
Knaff, and colleagues to transfer to operations a new probabilistic product. This product 
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uses a new statistical model to detennine the probabilities that certain wind speed 
thresholds will be exceeded at certain points. This project represented a radical departure 
from the old way of probabilistic TC forecasting. 

The goal of this thesis is to investigate whether improvements to this new 
probabilistic model could be made by introducing different distributions of track forecast 
errors that the model utilizes to calculate probabilities. If it is possible to use different 
distributions for different situations, the probabilistic output may be more representative. 
Such improved guidance could lead to a reduction in the massive costs of overly cautious 
evacuations when track forecast confidence is high, or even save lives by expanding the 
necessary evacuation zone when forecast confidence is low. For this thesis, the following 
hypothesis will be investigated: 

Track forecast error distributions may be altered by considering forecast 
uncertainty (i.e., difficulty), which is defined by the variations among the 
track forecast aids, and introduction of these track forecast error 
distributions by forecast interval will positively influence the strike 
probability distributions along the track. 

B. 2005 ATLANTIC HURRICANE SEASON 

Perhaps no other Atlantic hurricane season in history stressed the importance of 
timely and accurate forecasts than the 2005 season. With 28 tropical storms and an 
additional three depressions, the 2005 Atlantic hurricane season was the most active on 
record (see Figure 1). The season broke the previous record of 21 named storms set in 
1933. In addition, 15 of the 28 tropical storms intensified to hurricane strength, which 
broke the record of 12 set in 1969. To further illustrate the severity of the 2005 Atlantic 
hurricane season, seven of the 13 hurricanes became major hurricanes (Category 3 or 
higher), four of which made landfall in the United States (Dennis, Katrina, Rita, and 
Wilma). Also including in the 13 major hurricanes were four Category 5 storms (Emily, 
Katrina, Rita, and Wilma), which was one more than the previous record of three 
Category 5 storms (NHC 2006). 

The damages during the 2005 Atlantic hurricane season were estimated to be 
$150 billion. This total broke the prior damage records set in 1992 and 2004, which each 
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had $50 billion in damage when adjusted to 2005 dollars. It was also the deadliest 
hurricane season for the U.S. since 1928 (Wilson 2006). 



NATIONAL HURRICANE CENTER 
ATLANTIC • CARIBBEAN • GULF OF MEXICO • HURRICANE TRACK CHART 


Figure 1. 2005 Atlantic hurricane season track map (from NFIC, 

http://www.nhc.noaa.gov/tracks/2005atl.gif). 


Some other notable records broken during the 2005 Atlantic hurricane season 
include the strongest hurricane (Wilma 882 mb), three of the six strongest on record 
(Wilma 882 mb, Rita 897 mb, and Katrina 902 mb), top sustained winds (Emily at 
160 mph), and the longest-lasting hurricane for the month of December (Epsilon) 
(Wilson 2006). 

The 2005 hurricane season reinforced the importance of TC forecasting, 
preparation, and evacuation in advance of landfall. The devastation caused by Hurricane 
Katrina captured the public’s attention in a way that no hurricane has been able to do 
since the storm that unexpectedly hit Galveston, Texas in 1900 and killed over 6000 
people. Unfortunately, history has shown that sometimes it takes a catastrophe to initiate 
dramatic progress for the better. The hope is that the lessons learned from Katrina will 
save more lives in the future than were lost in this disaster. 
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C. CIVILIAN IMPACTS AND PROCEDURES 

In advance of a hurricane, the public has little control of the eventual outcome. 
When required, the best they can do is to shore up property, collect personal belongings, 
evacuate the area, and then hope for the best. 

Emergency managers use an estimate of economic impact of about $ 1 million per 
mile of coastline for evacuation. This number is highly subjective and studies have 
shown this number to be dependent on several factors, including population, economic 
class, and storm intensity. Whitehead (2000) found it cost between $1 million to $50 
million to evacuate the North Carolina (NC) coastline depending on stonn intensity and 
cost of evacuation plans. Since there are more than 50 miles of NC coastline (Figure 2), 
even the worse-case scenarios have the cost of evacuation less than $1 million per mile. 
Either way, evacuation is a costly proposition in preparation, transportation, and lodging, 
and the shutting down of the local economy for an unknown number of days. The 
cost/risk analysis is different for every evacuation zone, which makes the evacuation 
declaration a hard decision to make. 

Given the human lives and economic factors involved, the NHC takes its 
watch/warning advisories very seriously. A missed forecast may lead to a whole 
population center being at risk, and a false alarm can cost a lot of money and decrease 
public confidence in their forecasts. Hurricane Katrina was a perfect example of how 
public confidence can change over the years. For years, the doomsday scenario had been 
possible for New Orleans each time a hurricane approached that section of coast. In 
nearly every case, the storm would either miss or not be strong enough to cause 
significant damage, which may have caused some of the public to ignore the NHC 
advisories in advance of Katrina. The hope was that it would veer and miss just as had 
occurred plenty of times in the past. Three days before landfall, a lot of people were still 
lackadaisical about the threat, even though the NHC had high confidence in its forecast 
that New Orleans was likely to experience a direct hit. The end result was that many of 
people were caught off guard when the impact became imminent. By the time landfall 
near New Orleans was imminent, there was not enough time to evacuate for those who 
stayed behind. 
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Figure 2. North Carolina Flurricane Evacuation Routes map is an example of the 
complex planning involved for evacuating coastlines (from NOAA’s 
Hurricane Evacuation Zone Maps archive, 

http://www.dem.dcc.state.nc.us/hurricane/HurricaneEvacuationRoutes.pdf). 


The NHC is continually looking for ways to decrease the distance along the coast 
for their watch/warnings as their guidance products become more accurate. As 
previously mentioned, too many false alarms will lower the public’s confidence in the 
forecasts, especially because of the high costs and inconvenience involved with 
evacuation. The goal of this thesis is to contribute to the solution of these problems by 
improving hurricane probabilistic predictions. A decrease in false alarms directly saves 
the public money and indirectly saves lives by increasing the public’s confidence in the 
forecasts. 

D. MILITARY IMPACTS AND PROCEDURES 
1. TC Impacts on the Military 

Every military installation along the Eastern Seaboard, Gulf of Mexico, and in the 
Atlantic Basin has some type of hurricane preparedness plan. Each installation is unique 
in its plans, weather thresholds, and reaction to approaching TCs. 
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Air Force Bases, Army Air Fields, and Navy and Marine Airfields all have 
evacuation plans to fly aircraft out of the storm’s destructive path. In addition, military 
ships need to steer clear of the storm or sortie from a port, military installations need to 
be prepared, and personnel need time to prepare their property and families to evacuate. 
All of this preparation can cost money and a false alarm can cripple a budget, but a 
missed forecast can be much more expensive and devastating. So commanders want the 
best information available to make their decision. 

2. TC Impacts on the 45 th Weather Squadron (WS), Patrick AFB, 
Florida 

The 45 WS has perhaps the most sensitive mission in all of the U.S. military when 
it comes to hurricane preparedness. The NASA Space Shuttle and other space launch 
vehicles that are launched from Cape Canaveral Air Force Station (AFS) and Kennedy 
Space Center are unique in terms of importance, cost, ability to replace the resources, and 
the ability to reproduce the mission elsewhere (Figure 3). The 45 WS provides all of 
operational weather support to Cape Canaveral AFS and Kennedy Space Center. As 
home of the Space Shuttle along with a host of other rockets that often carry very 
expensive payloads, the 45 WS is required to provide highly specialized TC forecasts 
(Winters et al. 2006). 

Because of the time needed to prepare the Space Shuttle for a possible hurricane 
strike, preparation starts days in advance. In the daily briefing with the Kennedy Space 
Center, the 45 WS reports any tropical activity in the Atlantic Basin. If it looks as though 
a possibility exists that a TC will impact operations, then several escalating steps are 
made as a strike becomes more imminent. 


6 




Figure 3. Space Shuttle rollback from the launch pad to the Vehicle Assembly 
Building (Provided by William Roeder of the 45 WS). 

Interestingly, the 45 WS has communicated the threat of potentially hazardous TC 
weather to their customers in a probabilistic manner for years. Although some customers 
want a definite yes or no answer, the uncertainty that is inherent in forecasting the 
atmosphere makes this an impossible proposition. The NASA decision makers actually 
want a probabilistic forecast. Specifying probabilities allow the forecaster to 
communicate the uncertainty in the forecast to the customer so that they can use 
probabilistic decision making to minimize the costs, risks, and expected impacts. 
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The probabilities that the 45 WS provides have to be the best and most accurate 
available. A missed forecast can cause billions of dollars in damage to the Space Shuttle, 
other space launch vehicles, and payloads. On the other hand, a false alarm can be an 
expensive mistake too in costs to relocate the space launch vehicles to protective shelters, 
risks of damage during transport, and lost launch opportunities in the range schedule. 

The Space Shuttle and the launch pad can safely ride out wind gusts of 70 kt. If 
wind gusts greater than 70 kt are anticipated, then the Space Shuttle can rollback to the 
Vehicle Assembly Building where it can safely sustain steady winds up to 113 kt. 
However, the rollback phase has its own weather restrictions, including less than 10% 
chance of lightning within 20 n mi, steady winds less than 40 kt and peak winds less than 
60 kt. These weather conditions must also be forecast to allow a safe and successful 
rollback. While the rollback can be done in only 8 hours, preparations usually begin 39 h 
before the rollback, the decision is usually made 48 h before the start of the rollback. 
However, the formal decision process usually starts at least 72 h in advance of the 
rollback, with pre-planning beginning as early as 120 h in advance (Winters et al. 2006). 

In addition to the space vehicle rollback decisions, the 45 WS also gives advice as 
to the Hurricane Condition (HURCON), preparation, aircraft relocation, and personnel 
decisions by the 45 th Space Wing at Cape Canaveral AFS, Patrick AFB, and NASA at 
Kennedy Space Center. The HURCONs that are used by the military are based on the 
expected onset of 50-kt sustained winds from a tropical cyclone. A HURCON-IV, 
HURCON-III, HURCON-II, and HURCON-I means 50-kt winds are expected within 72 
hours, 48 hours, 24 hours, and 12 hours, respectively. 

When the new NHC probability products became available in 2005 to the public 
as an experimental product, the 45 WS began evaluating the product. This evaluation 
found the product to be extremely useful in that it provides a more objective method for 
producing probabilities that the winds will exceed their customer’s thresholds. These 
objective products were deemed superior to the previous primarily subjective method 
used by the 45 WS. 

Two main improvements were suggested by the 45 WS as a result of their 
evaluation: 1) perfonnance verification of the probabilities for the various wind 
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thresholds for all the forecast intervals; and 2) investigating if the product could be 
improved by scaling the errors for each forecast interval for each TC by the forecast 
confidence as parameterized by the ensemble spread among the forecast models. If the 
ensemble spreads of TC location and intensity for today’s forecast were only half of the 
historical spreads, then perhaps the statistical error model to be applied to today’s 
forecast would be better built using only half of the historical average. The perfonnance 
verification requested by the 45 WS would help them their customers on the proper 
decisions. For example, the 45 WS noticed that the probabilities at long forecast intervals 
were surprisingly low for locations that eventually experienced those winds, and such 
low probabilities could mislead their customers into delaying their preparations. A 3% 
probability of at least 50 kt steady wind at a location for a 120-h forecast may seem like a 
low risk, but it actually corresponds to a high risk. Thus, the 45 WS needs to know the 
verification rates of the NHC probability products as a function of forecast probability 
and forecast interval. 

This thesis is primarily driven by the need of the 45 WS to continuously look for 
ways to improve their TC forecasts. A successful effort could possibly lead to TC 
forecast improvements for both the military and civilian worlds. 

Background material regarding important concepts related to this thesis are 
provided in Chapter II. The methodology used for this study is described in Chapter III. 
The results of the study are presented in Chapter IV, and the conclusions and future 
recommendations are given in Chapter V. 
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II. BACKGROUND 


A. NATIONAL WEATHER SERVICE PROBABILISTIC TROPICAL 

CYCLONE FORECASTS 

1. Prior NHC Methods to Convey Uncertainty in the Forecast 

Starting in the early 1980s, the National Weather Service started issuing 
Watch/Warning graphical maps along with the TC advisories (Figure 4). This graphic 
display was established in part to help convey the uncertainty in TC forecasting to the 
public. The “cone,” or white area, on both sides of the track is the area in which over the 
last 10 years the TC will verily 90% of the time given the average track forecast errors. 
At each 12-h forecast interval, a circle is drawn with a radius of the average 10-year track 
forecast error, and then the circles are connected to form a cone. In other words, this 
model assumes that the same average track forecast error at each forecast interval applies 
to all tropical cyclones. While this conveys uncertainty in the forecast track of the storm, 
it does not include uncertainty in the forecasts of intensity and radii of wind speed 
thresholds. In addition, the public tends to fixate on the forecast track of the TC, or the 
center “black line,” while tending to ignore the cone on each side. Some people on the 
periphery of the cone have tended to delay evacuation until an actual shift in storm 
motion takes place, perhaps because they did not understand the actual probabilities of a 
strike. Thus, people on the periphery of the cone have been caught off guard several 
times in the past when the TC deviates from the forecast track. A great example of this 
situation is Hurricane Charley in 2004, which will be discussed in the next section. 

The NHC also issues a strike probability forecast in text format (Figure 5). The 
probabilities are determined by the percentage of times a TC within the given time frame 
will pass within 75 nautical miles (n mi) to the right or 50 n mi to the left of a point 
relative to the direction of cyclone motion. These probabilities are also conditioned on 
historical tracks and do not consider different intensities and critical radii differences 
among stonns. While this product may provide a good indicator of probability and 
uncertainty, the strike probability forecast only goes out to 72 h and is not a user-friendly 
product. 
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Because the strike probability forecast text products are not user-friendly and the 
Watch/Warning graphical product does not convey uncertainty in intensity and radii, a 
totally new method of probabilistic TC forecasting was created. 



Figure 4. Watch/Waming 3-Day map for Katrina (from NFIC, 
http://www.nhc.noaa.gov/archive/2005/KATRINA_graphics.shtml). 
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Figure 5. Strike-Probability forecast for Katrina (from NFIC, 

http://www.nhc.noaa.gov/archive/2005/prb/all22005.prblty.021.shtml). 
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2. New NHC Methods to Convey Uncertainty in the Forecast 

In recent years, probabilistic forecasting has gained increasing acceptance in the 
meteorological community as a supplement to deterministic forecasting. The correct use 
of probabilistic models along with deterministic models provides a forecaster with an 
opportunity to convey uncertainty in the forecast. 

In the early 2000s, it was apparent that the products NHC issued to the public 
were becoming obsolete. Other than routinely updating the 10-year average errors, an 
update to the Watch/Warning “cone” forecast had not occurred since its development in 
the early 1980s and the strike probability product still used out-of-date statistical analysis. 
Newer and better statistical methods were available, which along with faster computers 
made the new methods cost-effective. Gross et al. (2004) proposed a new way for 
determining TC wind speed probabilities. This new probabilistic product would not only 
convey uncertainty in the track forecast, but also convey uncertainty in stonn intensity 
and wind speed radii forecasts. 

A “Monte Carlo” (MC) sampling technique was employed to meet these new 
requirements. The MC method consists of statistically generating a sample of random 
numbers from a reference distribution and observing the properties of that sample. In this 
case, a large sample of plausible tracks relative to a given forecast track is generated by 
randomly selecting from track forecast error distributions derived from a historical 
database of NHC Official Forecast (OFCL) track forecasts. A similar approach is used 
for deriving the intensity and wind radii distributions. By summing the number of times 
a given wind speed threshold (34, 50, 64, or 100 kt) comes within a specific grid point 
and dividing by the total number of MC realizations, probabilities are determined (Gross 
et al. 2004). 

The advantage of using the MC method for the new model is that the track, 
intensity, and wind radii error distributions sampled often are not normally distributed or 
fit some assumed statistical form. Since these real error distributions are sampled 
directly, a Gaussian distribution is not required. 

The approach in this thesis is to adjust measures of the track forecast error 
distributions used by the MC model conditioned on measures of the track forecast 
confidence. The hypothesis is: If the track forecast confidence is high (low), then is it 
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more appropriate for the model to draw from historic track forecast errors that were 
produced when past forecast confidence was high (low). To examine this hypothesis, it 
must be determined that the track forecast error distributions are significantly different 
for different levels of forecast confidence. 

3. New Tropical Cyclone Surface Wind Speed Probability Products 

As mentioned previously, the new probability products will convey uncertainty in 
the track forecast, intensity, and wind speed radii. Gross et al. (2004) developed a 
graphical method and fonnat to more adequately display this information to the public. 

Since this new probability program is radically different from the previous 
operational probability products produced by the NHC, a committee within the NWS is 
providing oversight for the development of new operational products from the MC model 
output. These products were in experimental stage during the 2005 hurricane season, 
with the plan of providing them to the public starting in 2006 (Knaff and DeMaria 2005). 

The new graphical products using the new MC model give the cumulative 
probability for each point in the Atlantic basin and surrounding continents that a certain 
wind speed threshold will be exceeded within the given time period. For example 
(Figure 6), the current experimental products display cumulative probabilities for time 
periods from 12 h to 120 h. The probabilities are determined for wind speed thresholds 
consisting of 34-kt (tropical storm strength), 50-kt, and 64-kt (hurricane strength) winds. 

Some advantages of the new probability products from the NHC may be 
demonstrated with the case of Hurricane Charley in 2004. Although Hurricane Charley 
was forecast to hit Tampa Bay, Florida, the storm veered and the landfall point was at 
Port Charlotte, Florida, which caught some people off guard. Although the 
Watch/Warning advisory (Figure 7) had Port Charlotte within the “cone,” the public 
tends to focus on the forecast track, or the “black line,” while those on the periphery of 
the cone tend to wait for any changes in the forecast before taking immediate actions. 
Since Hurricane Charley was traveling at a small angle relative to the west coast of 
Florida, a small change in trajectory caused a large change in landfall location. The 
problem with Hurricane Charley is that when the change in track became apparent there 
wasn’t enough time to adequately protect properties/boats and evacuate. The result was 
many more dollars in damage than would have been if the public had adequate notice. 
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Figure 6. Experimental wind probability products for Hurricane Ivan (2004) for a) 24 
h, 39 mph, b) 120 h, 39 mph, c) 24 h 74, mph, and d) 120 h, 74 mph (from 
NHC http ://www.nhc .noaa. gov/feedback-pws-graphics2 .shtml). 


The new probability model (Figure 8) was in an experimental stage during 
Hurricane Charley. It is clear that 24 h before landfall both Tampa Bay and Port 
Charlotte have the same probability of hurricane force winds. If the public had the new 
graphic instead of the Watch/Warning “cone” graphic, perhaps more people at Port 
Charlotte would have been better prepared for a strike. 

Examples such as this are the main motivation behind the proposed modification 
of the NHC probability products to include forecast confidence-conditioned error 
distributions addressed in this thesis. When the public receives better probabilistic 
information, more lives and property may be saved. 
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Figure 7. Five-day Hurricane Charlie Watch/Waming valid at 15 UTC 12 Aug 2004 
(from Knaff and DeMaria 2005). 


64 kt Cumulative 


0-120 h 


18 UTC Thu 12 Aug 


Note that chances of 
hurricanerconditions at 
Tampa Bay and Port 
Charlotte are both 
around 30%! 



TPCPROB TUE 040817/1800V)20 SFC 64-KNOT or » WIND PROBABILITY 

Hurricane Charley (2004) 


Figure 8. New graphic of cumulative probability at 64-kt winds for Charley (from 
NHC, http://www.nhc.noaa.gov/feedback-pws-graphics2.shtml). 
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B. CONSENSUS FORECASTING 

1. Evolution of Consensus Forecasting in Meteorology 

Consensus forecasting consists of averaging multiple predictions that were 
generated from slightly different starting conditions of the same model or using several 
different forecasts. The final objectives are to minimize bias in specific models, 
minimize non-predictive components, and to improve the final forecast by averaging the 
different forecasts. Consensus forecasting has been used in fields that have to deal with 
dynamic modeling, ranging from finance to biology. 

Due to the lack of observations, limited computing power, different observing 
methods/instruments, differences in calibration among the same types of instruments, 
observer bias, and observer error, it is impossible to specify the true atmospheric state at 
any one time for input into a numerical forecast model. In addition, different models may 
have different biases that may consistently provide erroneous category values. The goal 
of consensus forecasting is to minimize the aforementioned errors by averaging many 
different forecasts. 

Consensus forecasting in meteorology started in the late 1970s. Thompson (1977) 
stated the advantage in consensus forecasts: 

The purpose of this note is to draw attention to a fact that does not appear 
to be widely recognized or accepted, but which was probably known to 
Gauss in 1802. This is the incontrovertible fact that two or more 
inaccurate but independent predictions of the same future events may be 
combined in a very specific way to yield predictions that are, on the 
average, more accurate than either or any of them taken individually. 

Thompson basically states that instead of relying on just one model to make a forecast, if 
we used several different models or several different initial conditions with the same 
model, they may be combined to produce a more accurate forecast of future conditions. 
Two approaches to consensus forecasting have been used to provide the future evolution 
of the atmosphere. A single numerical model may be integrated from many different 
initial conditions to provide a set of forecasts that describe the future state of the 
atmosphere. Alternately, the forecasts from different numerical models may be averaged. 
The consensus TC track forecasts use the latter method. Leslie and Fraedrich (1990) 
applied this method to TC track prediction by using linear combinations of forecasts from 
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various prediction models, and showed the consensus forecasts had a significant 
improvement. Goerss (2000) showed that using a consensus of operational track 
prediction models greatly reduced the average track forecast error over a season. 

In this thesis, a product called CONU that uses this consensus approach to provide 
an estimate of forecast difficulty or confidence will be used to specific different forecast 
error distributions that should be used in the MC model. 

2. CONU Product 

The Consensus (CONU) product is an average of any combination of five 
dynamical models (GFDI, AVNI, NGPI, UKMI, and GFNI) tracks, as long as at least two 
tracks are available. That is, the CONU forecast is the mean of the track forecasts from 
the five individual models. The fact that the CONU only requires two of the five models 
be available to produce a track forecast makes it available to forecasters more often than 
other consensus track forecasts such as GUNA that requires all four model tracks to be 
available (see below). 

As mentioned previously, consensus track forecasts almost always out-perfonn 
the individual members over time (Figure 9). The CONU product out-performs all of its 
individual members for every forecast interval except at 120 h, for which the UKMI did 
better by a few n mi. In addition, the CONU errors are almost exactly the same as for the 
official forecasts from the NHC, which indicates that the NHC forecasters rely on the 
CONU guidance and are able to add value to the CONU product at 96 h and 120 h. 

An important factor for forecasters is the product availability, since it will not be 
used if is only available sporadically or for a limited number of cases. Goerss (2006) 
compared the availability of the CONU product with that of other ensembles used in TC 
track forecasting. The GUNS ensemble is a simple consensus of the GFDL, UKMET, 
and NOGAPS tracks, and GUNA is computed when track forecasts from the GFDI, 
UKMI, NGPI, and AVNI are all available. The CONU product is available more often 
than the GUNS and GUNA products with 90% or above availability for all forecast 
intervals (Figure 10). 
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Figure 9. Non-homogeneous TC track forecast errors (n mi) on the ordinate for each 
forecast interval (h) displayed on the abscissa during the 2005 Atlantic 
season (From Goerss 2006). 



□ CONU 
■ GUNS 

□ GUNA 


Figure 10. Availability of various consensus products during the 2005 Atlantic season. 

The forecast interval (h) is on the abscissa and the percent availability is 
along the ordinate (From Goerss 2006). 
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Since the CONU product is provided when only two of its five ensemble 
members are available and GUNA is provided only when all four dynamical model tracks 
are available, one would expect GUNA to have smaller track errors over time than 
CONU. However, Goerss (2005) found that CONU slightly outperformed GUNA out to 
72 h and performed as well as GUNA out to 120 h during the 2001-2003 Atlantic 
hurricane seasons (Figure 11). Similar error statistics were observed for the 2005 season 
(Figure 12). Given the advantage of greater availability of the CONU over the GUNA 
(Figure 10), and the similarity in performance, CONU is the better choice for the 
forecaster. 

As the CONU forecasting concept gained increasing acceptance in the TC 
community, it became clear that another potential benefit that might be gained from the 
CONU sample of model tracks would be to predict the error of the CONU track forecast. 
That is, could the spread of the model tracks be used as a measure of the confidence that 
the forecaster should put in today’s CONU track forecast? 



Figure 11. Homogeneous comparison of TC track forecast errors for the official, 
GUNA, and CONU (see insert) during the 2001-2003 Atlantic hurricane 
seasons. The forecast interval (h) is displayed on the abscissa and the error 
(n mi) is on the ordinate (From Goerss 2005). 
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Figure 12. Track forecast errors (n mi, along ordinate) for the CONU product during 
the entire 2005 Atlantic season and for selected storms. The forecast 
interval (h) is given along the abscissa (From Goerss 2006). 


3. GPCE Value 

Goerss (2005) developed a method for predicting the error in the CONU TC track 
forecast that is called the Goerss Predicted Consensus Error (GPCE) In addition to the 
spread of the CONU members, other predictors are included that are available before the 
official forecast is issued by the NFIC. These predictors include: the consensus model 
spread, initial and forecast TC intensity, initial TC position and forecast displacement of 
TC position, TC speed of motion, and the number of members available for the CONU 
ensemble. The predictors were compiled from the 2001 through the 2004 Atlantic 
hurricane seasons. 

A stepwise linear regression model was derived to predict the errors in the CONU 
forecast, which are displayed as circular areas drawn around the CONU forecast 
positions. Each circle represents the area within which the TC position would verify in 
approximately 75% of the time. Since all of these predictors are available in real-time, 
the predicted error (GPCE) of the CONU forecast would be available to a forecaster 
before the official forecast was determined. Summed over the 2005 Atlantic hurricane 

season, the verifying positions were within the GPCE-derived circle approximately 75% 
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of the time (Figure 13). In individual storms, the verifying positions may lie outside the 
circle. For example, the GPCE performance for Katrina, Rita, and Wilma in Figure 13 
demonstrates the variability in performance for individual storms. However, when the 
three storms are averaged, the GPCE circles still contain around 75%. 

Some examples from Hurricane Katrina (Figure 14) demonstrate how the GPCE 
model works. The individual tracks represent the spread among the various members of 
the CONU, including the ensemble mean position which is at the center of the circles 
shown in Figure 14. The radius of the circle is the calculated GPCE value. The red dot 
represents the verifying position of Katrina. 



□ Season 

■ Katrina 

■ Rita 

□ Wilma 

□ 3 Storms 


Figure 13. For each forecast interval (h, abscissa), the percent of cases (ordinate) in 
which the verifying position is contained within the GPCE-defined value 
(From Goerss 2006). 


For a lead time of five days (Figure 14 a), the verifying position is just inside the 
GPCE circle that is centered on the CONU 120-h forecast position near Panama City, 
Florida. The high variability among the 120-h individual model positions that make up 
CONU leads to the large circle radius. For a four days lead (Figure 14 b), the ensemble 
members are in better agreement so the GPCE radius is smaller. Given the CONU 96-h 
position and the smaller circle, the verifying position falls outside the predicted forecast 
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error circle, which is to be expected in about one out of every four forecasts. The radius 
of the GPCE circle becomes progressively smaller at lead times of two days (Figure 14 d) 
and one day (Figure 14 e) as the ensemble member tracks are more in agreement. The 
verifying positions fall well within the GPCE radii for both of these forecasts. 

Results from the 2005 Atlantic hurricane season (Goerss 2006) verify that the 
GPCE model is a reliable tool to determine forecast confidence in the CONU forecast. 
For the season, the verifying positions of the TCs fell within the GPCE circles 76%, 77%, 
77%, 75%, and 75% for the 24-, 48-, 72-, 96-, and 120-h forecasts, respectively (Goerss 
2006). 



Figure 14. Predicted consensus error for CONU forecasts of Hurricane Katrina at (a) 
120 h from 12 UTC 24 Aug, (b) 96 h from 12 UTC 25 Aug, (c) 72 h from 
12 UTC 26 Aug, (d) 48 h from 12 UTC 27 Aug, (e) 24 h from 28 Aug 2005 
(From Goerss 2006). 
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C. RELATIONSHIP BETWEEN MODEL SPREAD AND FORECAST SKILL 
1. Measuring Forecast Position Error 

The error of the forecast is determined from the verifying best-track position, 
which, is determined in the post-analysis stage after looking at all the available data. The 
error in the CONU model is dependent upon two things: 1) the mean forecast error of the 
individual models that make up the consensus; and 2) the degree of independence of the 
forecast errors of the individual models (Goerss 2000). 

In addition to the magnitude of the error of the forecast track, the error may also 
be defined in terms of the cross-track forecast error C and the along-track forecast error A 

E = (C 2 +A 2 ) 1/2 (1) 

in which E is position error (Neumann and Pelissier 1981). The along-track forecast 
error represents whether the forecast was fast (positive value) or slow (negative value). 
The cross-track forecast error represents how far left (negative value) or right (positive 
value) the forecast track is relative to the verifying position (see Figure 15). 



Figure 15. Definition of cross-track forecast error (XTE), along-track forecast error 
(ATE) and forecast track error (FTE). In this example, the forecast position 
is ahead of and to the right of the verifying best track position. Therefore, 
the XTE is positive (to the right of the best track) and the ATE is positive 
(ahead or faster than the best track) (from NPMOC 
http://www.npmoc.navy.mil/itwc/atcr/1998atcr/ch5/chap5 pagel.html .) 
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2. Ensemble Spread and Forecast Skill Relationship 

Forecast skill is a simple measure of how different a given forecast is from the 
actual conditions. Given an ensemble of forecasts, the greater the spread among the 
members, the less confidence the forecaster may have in any specific member or in the 
mean of all the members. On the other hand, the better the agreement among the 
individual members, the higher confidence the forecaster is likely to have that the 
ensemble-mean forecast will adequately represent the true state of the atmosphere. 

However, the correlation between ensemble spread and track forecast skill is not 
always so clear. For example, Goerss (2000) found no clear correlations between 
ensemble spread and ensemble mean error for an ensemble track prediction systems. 
However, Goerss found a relationship between the ensemble spread and the upper bound 
of error. Elsberry and Carr (2000) investigated western North Pacific TCs using a 
consensus of five dynamical models tracks and found that spread was a good indicator of 
track forecast error for low spread values. However, up to 8% of the cases with a small 
consensus spread had large track forecast errors. A selective consensus, in which a 
forecaster would eliminate only the largest track forecast error from the five models, may 
add value and reduce track forecast errors. Grimit and Mass (2006) stated that it is not 
good to simply use ensemble spread to predict a single realization of ensemble forecast 
track forecast error. Rather, they suggest ensemble spread should be used to correlate the 
distribution of ensemble-mean forecast track forecast errors over a large number of 
realizations. 

In this thesis, the correlation between the spread in the CONU members and in the 
GFS ensemble members will be investigated. 

3. Measuring Model Spread 

As mentioned previously, the spread in the CONU model is one of the factors 
used to define the GPCE value. Therefore, the GPCE value is used to represent model 
spread for the purpose of partitioning forecast errors. When the GFS ensemble is used to 
measure forecast confidence, the spread among the ensemble members is defined as the 
average separation distance with respect to the ensemble mean. 
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III. METHODOLOGY 


A. DATA 

1. Data Source 

This thesis examines data from the 2005 Atlantic hurricane season. Although 
only one season, no other season has come close to the amount of activity and the amount 
of data collected. The 2005 Atlantic hurricane season included as many storms as two to 
three past seasons combined and was also well distributed across a wide geographic 
region. With 28 named tropical storms and an additional three unnamed tropical 
depressions, 648 official forecasts were issued by the NHC. Considering there are seven 
time periods for each forecast (12, 24, 36, 48, 72, 96, and 120 h), that means there was a 
potential for 4130 track forecast errors in 2005. Since a verifying position was not 
always available due to the TC hitting land or TC decay, the number of track forecast 
errors available was somewhat less. Since the 2005 Atlantic hurricane season produced 
such a large data set, it is expected to be large enough to test the hypothesis of this thesis. 

2. Data Format 

The data available from the NHC included every official forecast, a majority of 
the model forecasts, and all of the best-track positions of the TCs. The so-called A- 
Decks, B-Decks, and E-Decks from the Automated Tropical Cyclone Forecast (ATCF) 
system were used in this thesis. A-Decks are comprised of all of the model and ensemble 
forecasts available to the NHC during the season along with their OFCF forecasts. 
Information included in these files are the storm number, model, forecast time and period, 
intensity, and forecast position in longitude and latitude. 

The B-Decks are the best-track positions of the Atlantic TCs in 2005. As 
previously mentioned, the best-track position is the verifying position of the TC after all 
the infonnation has been evaluated in post-storm analysis. Included in these files are the 
storm number, date and time, intensity, and verifying location in longitude and latitude. 
Thus, the A-Decks can be used in conjunction with the B-Decks to find track forecast 
error. 
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The E-Decks contain the calculated GPCE values for the 2005 season. These files 
include the storm number, date and time, verifying storm location, and the calculated 
GPCE value. The E-Decks were used in conjunction with the A- and B-Decks to match 
the calculated GPCE value to the associated actual track forecast error. 

B. STATISTICAL METHODS OF ANALYSIS 

1. Testing for Differences in Mean 

Consider the three distributions with the same means (Figure 16), but with 
significant differences among the three. The two distributions in the low variability case 
have very little overlap and thus are significantly different. Although the medium 
variability case has some overlap, the two distributions are distinctly different. In the 
high variability case, it becomes more difficult to distinguish the two populations. 

In this thesis, the first goal is to demonstrate that the track forecast error 
distributions are significantly different when separated into the low or medium variability 
case demonstrated in Figure 16. If there is little difference between them as 
demonstrated by the high variability case in Figure 16, then most likely there will be little 
improvement in modifying the MC model for different variabilities since the distributions 
will be so similar that using them independently will not change the probability output. 



Figure 16. Three pairs of distributions with the same mean (from Trochim, 
http://www.socialresearchmethods.net/kb). 
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An objective method is used to detennine if two populations are significantly 
different. The t-Test statistic (T) is a function of the differences between the two sample 
means, and takes into account the size and variance of the distributions, 


j _ X\ X 2 /u 0 

EZ 

V n i n 2 


( 2 ) 


where X\ and Xi are the means of the two samples, /uo = /U 1 -/U 2 = 0 is the 
hypothesized difference between the two means, Si and S 2 are the standard deviations of 
the two samples, and ni and n 2 are the numbers of members in each sample. For this 
thesis, the t-statistic is evaluated using a 95% confidence level, which means the test will 
be in error no more then five out of 100 times. 

The null hypothesis for this test is that the two means are the same (jui - jU 2 = 0) 
(Figure 17). If the null hypothesis is true, then the t-statistic will fall in the acceptance 
region of the t-distribution (t-statistic < t-critical). If the null hypothesis is false, then the 
t-statistic will fall in the critical region of the t-distribution (t-statistic > t-critical). 



Figure 17. Hypothesis test for differences in mean (from Wadsworth, 
http://www.wadsworth.com/psychology_d/templates/student_resources/wor 
kshops/stat_workshp/ttest_betwn/ttest_betwn_02.html). 
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As indicated in Figure 16, the means of two samples may be the same and the 
variances are different. This situation may apply with cross- and along-track forecast 
errors since the values are both positive and negative. These errors tend to cancel at 
times, which leads to very little differences between the distributions. However, the 
means of the distributions can be used as an indicator of the skewness. Using different 
distributions that have the same mean but different variances also makes them separable 
and may also improve the MC method. 

2. Testing for Differences in Variance 

The test for differences in variances is simply taking the ratio of the larger sample 
variance over the smaller sample variance and comparing it to a F-distribution at a certain 
confidence level and degrees of freedom. As with the t-tests, the confidence level in this 
thesis will be 95%. The equation for the F-statistic is 


F = 



( 3 ) 


where S 2 is the larger variance. 

The null hypothesis is that the two variances are equal, so if the F-statistic is 
smaller than the F-critical value, it lies in the acceptance region and the null hypothesis is 
true. If the F-statistic is larger than the F-critical value, then it is in the critical region and 
the null hypothesis is rejected. In other words, the variances between the two samples are 
significantly different. 

This simple test for determining the differences in variance is an important piece 
of this thesis. If the Monte Carlo model were to draw from two distributions that had the 
same mean but different variances, then there may still be a significant improvement to 
the model. 


3. Histograms 

Some of the properties that can be determined from a histogram include the 
median of the data, spread of the data, skewness of the data, presence of outliers, and the 
presence of multiple modes. These properties can be a good indicator of a proper 
distribution or a random distribution. To create a histogram, the values are separated into 
bins of a predetermined size. The number of values in each bin detennines the 
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frequency. The shape of the resulting histogram can give infonnation about the skewness 
or variance of the distributions. In this thesis, histograms are used to compare different 
levels of forecast confidence for each forecast interval. 

4. Linear Regressions and Correlations 

Linear regression is a classical statistical method to find the relationship between 
the predictand (Y; dependent variable) and the predictor (X; independent variable). In 
this thesis, linear regression is used to examine the correlation between forecast error (Y; 
dependent variable) and GPCE value (X; independent variable). Three values are used in 
this thesis to illustrate the relationship: the multiple R which represents the correlation; 
the adjusted R2, which is the amount of the variance in forecast error that can be 
explained by the GPCE value; and P-value, which represents statistical significance of 
the R2 value. 

A perfect correlation is not necessary when separating the distributions for the 
MC model. If the GPCE was perfectly correlated with the track forecast error, no 
probabilistic model would be needed. The expected distributions should contain some 
large-track forecast errors for low GPCE values and small-track forecast errors for high 
GPCE values to match the 75% hit rate of the GPCE radius (Figure 13). 
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IV. ANALYSIS AND RESULTS 


A. INTRODUCTION 

The goal of this thesis is to demonstrate the likely benefit of conditioning the 
track error distributions from which the MC model draws based on forecast confidence 
rather than using a single track error distribution for all forecasts. Forecast confidence is 
measured by the predicted error (GPCE) in the consensus track forecasts. If the forecast 
confidence is high, then it may be beneficial for the probabilistic model to draw from 
historic track errors that were produced in the past under similar forecast confidence 
conditions. The same logic would apply for low forecast confidence cases. The thesis 
goal is to determine whether these distributions are significantly different such that 
changing the MC model would be beneficial. 

Five 2005 Atlantic hurricane season track error distributions are tested to see if 
they are separable in means and variance: OFCL total-track error and the corresponding 
GPCE value; OFCL along-track error and the corresponding GPCE value; OFCL cross¬ 
track error and the corresponding GPCE value; OFCL total-track error and the 
corresponding NCEP Global Forecasting System (GFS) ensemble spread; and the GFS 
ensemble mean total-track error and the corresponding GFS ensemble spread. For each 
forecast interval (12, 24, 36, 48, 72, 96, and 120 h), the track-error distributions and their 
corresponding measures of forecast confidence are evaluated independently. 

Although the MC model draws from distributions of along- and cross-track errors, 
the total-track errors are examined first. If the total-track error distributions are not found 
to be significantly different when conditioned on forecast confidence, then there is no 
point to investigate further using the along- and cross-track components. Conversely, if 
the total-track error distributions are found to be significantly different, then an 
examination of the along- and cross-track errors is warranted. 

The distributions of track errors were binned into terciles conditioned on forecast 
confidence values that are available to the forecaster when the forecast is made (i.e., 
GPCE value or GFS ensemble spread). The resulting track-error distributions were 
compared using several statistical methods: differences in means and variances (Sections 
IIIB.l and 2); histogram evaluation (Section IIIB.3); and linear regression (Section 
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IIIB.4). However, linear regression and the differences in means were not used for the 
cross- and along-track errors, since the positive and negative values led to the means near 
zero. The key statistical test for those distributions is the differences in variances since it 
is the variance that impacts the probabilistic output of the MC model when it draws from 
the distributions of along- and cross-track errors. 

Although terciles usually consist of three equal-sized distributions, the 
conditioned distributions for a majority of the forecast intervals in this thesis do not 
always consist of equal numbers of samples. Different sizes arise because the samples in 
the distributions (track forecast errors) were not the values used to calculate the tercile 
levels, but rather they correspond to the samples of GPCE values or GFS ensemble 
spreads. For OFCF track forecast errors conditioned on GPCE values, the distributions 
of the lower tercile (high forecast confidence) contain larger numbers than the other 
terciles for most of the forecast intervals. Many times when a GPCE value was 
calculated, especially for the longer forecast intervals, a corresponding OFCL track 
forecast error did not exist because either the TC no longer existed or became 
extratropical, so a verifying position could not be established. The reason the numbers of 
OFCL forecast track errors decrease significantly with large GPCE values at the longer 
forecast intervals is that many of the large GPCE values were calculated toward the end 
of the TC life cycle when the models usually have less skill. In other words, toward the 
end of a TC life cycle, the CONU ensemble members had a large spread, so the resulting 
GPCE value was large. Many of those large GPCE values did not have a corresponding 
track forecast error because the TC no longer existed at the verification time for the 
forecast. 

For the OFCL and GFS ensemble mean total-track forecast errors conditioned on 
GFS ensemble spreads, terciles at each forecast interval do have roughly equal-sized 
distributions, which is in part due to the low number of samples from the GFS ensemble. 
The samples that were included in the A-Decks usually had a corresponding OFCL track 
forecast error. 

Some of the nomenclature used in this chapter and the next may seem 
contradictory. Although the null-hypothesis for the test for different means and variances 
is that they are equal, at times the term “fail” will be used when the null test for zero 
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difference in means or variances is accepted, because the goal is to separate the total 
distribution into smaller ones with different means and/or variances. Similarly, “pass” 
will be used when this null test is rejected. 

B. OFFICIAL TOTAL-TRACK FORECAST ERRORS CONDITIONED ON 

GPCE VALUE 

The OFCL total-track error is the distance between the OFCL forecast position 
and the verifying best-track position (Figure 15). These track errors were binned in three 
distributions based of the corresponding tercile GPCE value. Since the GPCE value is 
directly correlated with the spread among members of the CONU ensemble, the GPCE 
values were divided into terciles to define low, average, and high forecast confidence that 
correspond to high, average, and low CONU ensemble spreads. 

The results of the comparisons are summarized in Table 1. In this table, the 
characteristics of the track distributions of OFCL total forecast errors and each tercile 
distributions for each forecast interval are defined in terms of the sample size, mean (n 
mi), and standard deviation (n mi). Based on the linear regression of the OFCL total- 
track forecast errors on the GPCE values, the correlation (R), the amount of variance in 
the OFCL total-track forecast errors explained by the GPCE values (R“), and the 
statistical significance (P-value) are defined in the table for each forecast interval. 

Tercile track forecast error histograms (Figures 18 through 21) are compared to 
examine changes in skewness from high forecast confidence to low. The hypothesis is 
that for high forecast confidence the distributions (Figures 18a-21a) should be skewed to 
the right, which indicates lower forecast track errors. Next, average forecast confidence 
distributions (Figures 18b-2lb) should be distributed about the mean to indicate a high 
number of mid-range track errors. Finally, low forecast confidence distributions (Figures 
18c-21c) should be skewed more to the left to indicate a high number of large track 
errors. 
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Table 1. The tercile comparison table for the OFCL total-track forecast errors 
conditioned on GPCE values. The legend in the upper right portion of the 
table defines the color scheme and tercile definitions. 
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1. Analysis and Results 

For the 12-h forecast interval tercile comparisons, the tests for differences in 
means and variances demonstrate that all three distributions are significantly different 
(Table 1). However, the difference in means between the middle and upper terciles were 
about half of the difference in means between the lower and middle, which may be a 
result of the short time frame of the forecast (12 h). Such a short forecast interval limits 
how far the forecast position can be off the verifying track. In other words, large 
differences in speed can result in relatively small differences in distance traveled. So 
even if the OFCL forecast TC speed is in error, a relatively short total-track forecast error 
may result. Consequently, the middle- and upper-tercile distributions look similar 
(Figure 18, 12-h panels b and c). Both distributions range from 10 to 120 n mi (not 
including outliers) with similar frequencies in each bin. Additionally, the track errors 
based on average forecast confidence (panel b) drop from 17 track errors to ten after 
70 n mi, while the track errors based on low forecast confidence (panel c) drop from 16 to 
six after 80 n mi. Thus, after 70 to 80 n mi the few track errors that occurred were widely 
spread. The biggest difference between these two terciles is that the upper tercile has a 
few more outliers above 100 n mi that may account for the comparison passing both tests. 

All three distributions at 12 h are skewed to the right, and the magnitude increases 
with decreasing forecast confidence, which is consistent with the hypothesis given above. 
Not many track errors are expected to skew any of the distributions toward zero values on 
the left. The means and variances for the 12-h forecast interval distributions increase as 
forecast confidence decreases (Table 1). This indicates that track errors are getting larger 
with more spread that indicates lowering forecast confidence. 

The 24-h forecast interval tercile comparisons passed all tests except for 
difference in variances between the lower and middle tercile (Table 1). Although they 
have significantly different means and different shapes (Figure 18, 24-h panels a and b), 
the standard deviations differ by only 3 n mi. However, the high confidence tercile is 
skewed to the right, while the middle tercile distribution is somewhat centered. This 
difference is expected between high and average forecast confidence. Also both the error 
means and variances increase with decreasing forecast confidence as measured by the 
GPCE values. 
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Figure 18. Flistograms of 12- (left column) and 24-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


The 36- and 48-h forecast interval tercile comparisons both passed all statistical 
tests as their means and variances increase with decreasing forecast confidence (Table 1). 
In addition, the tercile histograms (Figure 19) show the hypothesized progression in 
skewness between high, average, and low forecast confidence. 
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Figure 19. Flistograms of 36- (left column) and 48-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


Both the lower- and middle-tercile and the lower- and upper-tercile comparisons 
passed the statistical tests for the 72- and 96-h forecast intervals (Table 1). However, the 
middle- and upper-tercile comparisons did not. Despite this, both the 72- and 96-h 
forecast intervals had increases in the means and variances with decreasing forecast 
confidence along with the hypothesized progression of skewness (Figure 20). 
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figure 20. Histograms of 72- (left column) and 96-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


For the 72-h forecast middle- and upper-tercile comparison, only a 19 n mi 
difference occurred, while the lower- and middle-terciles had a 48 n mi difference. The 
corresponding difference in standard deviations was only 15 n mi, while the lower- and 
middle-terciles comparison had a difference of 41 n mi (Table 1). These similarities can 
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be seen in the widths of the distributions in Figure 20, 72-h panels b and c. Despite the 
failed tests, the distributions still have the hypothesized progression in skewness. 

The statistical tests for the 96-h forecast interval middle- and upper-tercile 
comparisons both resulted in P-values of 0.05. In other words, if this were evaluated 
using a 96% confidence level instead of 95%, they would both fail. 

Both the middle and upper and the lower- and upper-tercile comparisons passed 
the statistical tests for the 120-h forecast interval (Table 1). However, the lower- and 
middle-tercile comparisons failed both tests, although the test for differences in variances 
would pass at a 94% confidence level. 

The reasoning for the poor results of the lower- and middle-tercile comparisons 
are analogous to the reasoning given for the 0.03 P-value of the middle and upper terciles 
test for differences in means for the 12-h forecast interval. Just as a short time frame 
limits the distances the TCs travel, the long time integration to 120 h means that even 
slow moving TCs have traveled a large distance. Consequently, small differences in 
speed result in large differences in distance. Therefore, the potential for larger track 
errors from missed forecasts is greater. Even if the OFCL forecast missed the TC speed 
by a small margin, a relatively large track error will result. Consequently, the lower- and 
middle-tercile distributions are similar (Figure 21). Both distributions have the same 
range (40 n mi to 680 n mi) not including outliers, and both have similar shapes. 

The lower and middle tercile distributions are both skewed to the right (Figure 
21). The upper tercile skewness is not easily discernable due to the small number of 
samples. Although the mean and variance increase as forecast confidence decreases 
(Table 1), the skewness of the distributions do not follow the hypothesized progression 
from high to low forecast confidence. 
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Figure 21. Flistograms of 120-h OFCL total-track forecast error distributions 
conditioned on a) high, b) average, and c) low forecast confidence. The 
means are represented by the dashed lines. 


2. Summary 

The tests for differences in OFCL forecast error means resulted in 19 of the 21 

tercile comparisons having significantly different means when conditioned by the GPCE 

values. Similarly, the tests for differences in variances showed that 18 of the 21 tercile 

comparisons had significantly different variances. 
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The correlations between GPCE values and OFCL total-track forecast errors 
remained somewhat constant throughout the forecast intervals (Table 1). The 12-, 24-, 
36-, 48-, and 120-h forecasts all had correlations from 0.37 to 0.41. The only exceptions 
were the 72- and 96-h forecasts, which both had a correlation of 0.29. Since a perfect or 
high correlation is not needed because uncorrelated values need to be represented in the 
probabilistic model, these results are consistent with expectations. These positive 
correlations reflect that the means and variances increased with decreasing forecast 
confidence for every forecast interval. 

The skewness of the different terciles followed the hypothesized progression from 
right to left with decreasing forecast confidence for all forecast intervals except for 12 h 
and 120 h (Figures 18 through 21). This indicates that high forecast confidence cases are 
associated with more small track errors, average forecast confidence cases have more 
average tracks errors, and low forecast confidence cases are associated with more large 
track errors. 

All of these factors combined indicate that the GPCE value is a good indicator of 
forecast confidence for the OFCF total-track forecasts. Using tercile GPCE value to 
condition total-track forecast error into terciles will most likely result in three 
significantly different distributions. Consequently, these total-track results warrant an 
examination of the along- and cross-track OFCL forecast errors to see if they also can be 
separated into significantly different distributions. If so, then those distributions may 
improve the probabilistic output of the MC model. 

C. OFFICIAL ALONG-TRACK FORECAST ERRORS CONDITIONED ON 

GPCE VALUE 

The OFCL along-track error is a component of the total-track error and can be 
used as a measure of whether the forecast is fast or slow (Figure 15). The errors can be 
positive (which represents a forecast that was fast) or negative (which represents a 
forecast that was slow). Using the same method as the previous section, the OFCL 
along-track forecast errors were binned into three distributions based on the 
corresponding tercile GPCE value. 

The results of the comparison are summarized in Table 2, and the tercile 

comparison histograms are displayed in Figures 22 through 25. Linear regression was 
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not performed because the positive and negative values would be compared with only 
positive GPCE values. Since along-track error is a component of the total-track error, the 
linear regressions from the previous section are still representative for each forecast 
interval. Although, the tests for differences in means are displayed in Table 2, the results 
are not as important since the negative and positive values tend to cancel. Therefore, the 
values are not highlighted as pass or fail. However, the means themselves will be used to 
determine whether a bias exists in the OFCL along-track forecasts. The key statistical 
test for these distributions is the differences in variances since it is the variance that 
impacts the probabilistic output of the MC model when it draws from the along-track 
forecast error distributions. 
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Table 2. The tercile comparison table for the OFCL along-track forecast errors 
conditioned on GPCE values. The legend in the upper right portion of the 
table defines the color scheme and tercile definitions. 


Forecast Interval 

12-H Forecast 

OFCL Along-Track Forecast Errors 
Conditioned on GPCE Value 

Tercile 

Lower 

Middle 

Upper 

Samples 

159 

164 

166 

Legend 

Mean (n mi) 

-0.5 

-11.1 

-18.4 

Statistically 

Different 

(at 0.05 alpha) 

Within 1% C.L.of 
Pass/Fail 


Standard Deviation (n mi) 

23.4 

32.1 

38.5 

Total Distribution 

. .. Standard 

Samples: Mean: ^ . . 

489 -10.2 Deviation: 

32.8 

Lower(L) - Lower tercile of track forecast errors 
when forecast confidence was high 

Comparison 

L vs. M 

M vs. U 

L vs. U 

Middle(M) - Middle tercile of track forecast 
errors when forecast confidence was 
average 

Test for differences in means 

t-Stat: 3.40 
P: 0.00 

t-Stat: 1.86 

P: 0.06 

t-Stat: 5.08 
P: 0.00 

Test for differences in variances 

F-Stat: 1.89 

P: 0.00 

F-Stat: 1.44 

P: 0.01 

F-Stat: 2.72 

P: 0.00 

Upper(U) - Upper tercile of track forecast errors 
when forecast confidence was low 

Forecast Interval 

2 A 

1-H Forecast 

36-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

172 

152 

142 

167 

153 

112 

Mean (n mi) 

-2.3 

-13.5 

-32 

-7.2 

-12.2 

-36.1 

Standard Deviation (n mi) 

41.1 

50 

58.2 

59.6 

71.6 

73.4 

Total Distribution 

_ , Standard 

Samples: Mean: ^ . .. 

466 -15 Deviation: 

51 

o i Standard 

Samples: Mean: ^ . .. 

432 -16.5 Deviation: 

68.7 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 2.19 

P: 0.03 

t-Stat: 2.92 

P: 0.00 

t-Stat: 5.12 

P: 0.00 

t-Stat: 0.68 

P: 0.50 

t-Stat: 2.64 

P: 0.01 

t-Stat: 3.46 

P: 0.00 

Test for differences in variances 

F-Stat: 1.47 

P: 0.01 

F-Stat: 1.36 

P: 0.03 

F-Stat: 2.00 

P: 0.00 

F-Stat: 1.44 

P: 0.01 

!-H Foreca 

F-Stat: 1.53 

P: 0.01 

Forecast Interval 

48-H Forecast 

72 

St 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

150 

149 

91 

152 

104 

57 

Mean (n mi) 

-4.4 

-11.8 

-33.2 

-8.6 

-60.9 

-26 

Standard Deviation (n mi) 

68.3 

89.2 

122 

108 

148 

150 

Total Distribution 

_ . Standard 

Samples: Mean: _ . . a 

390 -14 Deviation: 

91.4 

o i .. Standard 

Samples: Mean: ^ 

313 -29.1 Deviation: 

132 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 0.80 

P: 0.42 

t-Stat: 1.46 

P: 0.15 

t-Stat: 2.08 

P: 0.04 

t-Stat: 3.08 

P: 0.00 

t-Stat: 1.42 

P: 0.16 

t-Stat: 0.80 

P: 0.43 

Test for differences in variances 

F-Stat: 1.71 

P: 0.00 

F-Stat: 1.86 

P: 0.00 

F-Stat: 3.17 

P: 0.00 

F-Stat: 1.87 

P: 0.00 

)-H Forec. 

F-Stat: 1.91 

P: 0.00 

Forecast Interval 

96-H Forecast 

12 

ast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

110 

80 

51 

88 

55 

45 

Mean (n mi) 

-18.7 

-80.9 

-76.4 

-41.7 

-75 

-41 

Standard Deviation (n mi) 

117 

227 

307 

225 

261 

384 

Total Distribution 

. .. Standard 

Samples: Mean: ^ . . 

241 -51.6 Deviation: 

209 

o . .. Standard 

Samples: Mean: „ . 

188 -51.3 Deviation: 

280 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 2.24 

P: 0.03 

t-Stat: 0.09 

P: 0.93 

t-Stat: 1.30 

P: 0.20 

t-Stat: 0.78 

P: 0.44 

t-Stat: 0.51 

P: 0.61 

t-Stat: 0.01 

P: 0.99 

Test for differences in variances 

F-Stat: 3.75 

P: 0.00 

F-Stat: 1.82 

P: 0.01 

F-Stat: 6.85 F-Stat: 2.17 

F-Stat: 2.91 

P: 0.00 
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1. Analysis and Results 

The 12- and 24-h forecast interval tercile comparisons both passed the test for 
differences in variances (Table 2). In addition, the variances increased as forecast 
confidence decreased. However, the middle and top terciles were more similar than the 
lower and middle, which again may be due to the short forecast intervals as discussed in 
the previous section. All three tercile along-track error distributions for both of these 
forecast intervals have a negative bias (Table 2). Interestingly, the upper limit of positive 
along-track errors does not increase as forecast confidence decreases for both forecast 
intervals (Figure 22). Rather, the negative along-track errors increase in magnitude as 
forecast confidence decreases. 

The 36-h forecast tercile comparisons all passed the test for differences in variances 
except for the middle- and upper-terciles comparison (Table 2). The difference in 
standard deviations between the two is only 1.8 n mi compared to 12 n mi between the 
lower and middle terciles. However, the variance does increase with decreasing forecast 
confidence for this forecast interval. All three terciles for the 36-h forecast interval have 
a negative bias that increases with decreasing forecast confidence (Table 2). Although 
the middle and upper terciles statistically have the same variance, the upper tercile is 
shifted slightly farther to the left than the middle (Figure 23, 36-h panels b and c). This 
shift indicates that the largest along-track errors are negative in value, especially when 
forecast confidence is low. 

For the 48-h forecast interval tercile comparisons, all passed the test for differences 
in variances (Table 2). All three terciles have negative biases that, along with variances, 
increase as forecast confidence decreases (Figure 23, 48-h). 

The 72-h forecast interval tercile comparisons had one failure in the tests for 
differences in variances, which is the middle- and upper-terciles comparison (Table 2). 
The standard deviation is nearly identical between the two, which is consistent with the 
failed test for the same terciles comparison as for the total-track errors in Table 1. 
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Figure 22. Flistograms of 12- (left column) and 24-h (right column) OFCL along-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


The 72-, 96-, and 120-h forecast interval terciles all had negative along-track 
biases. For all three, the middle tercile representing average forecast confidence had the 
largest negative along-track bias. When taking into account the large range of along- 
track forecast errors at the longer forecast intervals, the differences become less 
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significant (Figures 24 and 25). Despite this discrepancy, variance does increase with 
decreasing forecast confidence for all three of these forecast intervals. 



Figure 23. Histograms of 36- (left column) and 48-h (right column) OFCL along-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 
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The 96-h forecast tercile comparisons all passed the tests for differences in 
variances, while the 120-h forecast interval had one failure, the lower- and middle- 
terciles comparison (Table 2). This failure is for the same terciles comparison as for the 
total-track errors (Table 1). 


72-Hour 


a) J» 



: llwl 
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05 0 500 - 3SO -200 50 100 2S0 400 SSO 700 Mm 

Forecast Error (n m) 


Figure 24. Histograms of 72- (left column) and 96-h (right column) OFCL along-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 
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Figure 25. Flistograms of 120-h OFCL along-track forecast error distributions 
conditioned on a) high, b) average, and c) low forecast confidence. The 
means are represented by the dashed lines. 
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2. Summary 

The differences in variances tests for the OFCL along-track errors resulted in 18 
of the 21 tercile comparisons having significantly different variances. Two of the failures 
were consistent with the total-track failures for the 72-h middle and upper terciles and 
120-h lower and middle tercile comparisons. The failure of the 36-h middle- and upper- 
terciles along-track errors comparison did not have a corresponding failure with total- 
track errors. However, that comparison did have a P-value of 0.03. Conversely, the 24-h 
along-track error lower and middle terciles comparison passed while the corresponding 
total-track error terciles comparison failed. 

Every along-track error tercile had a negative bias for each forecast interval, 
which indicates that the OFCL forecast is consistently slow. If this tendency stays 
consistent from year to year, the probabilistic model will account for this bias in its 
output. 

All forecast intervals had increasing along-track error variances with decreasing 
forecast confidence. That is, as forecast confidence lowers, the forecast along-track error 
becomes less predictable. If the MC model drew from three different distributions of 
along-track errors based on forecast confidence, the area covered by each probability 
interval will increase with decreasing forecast confidence. 

Taking these factors into account along with the results of the total-track error 
comparisons, it is clear that OFCL along-track forecast errors can be successfully 
stratified by forecast confidence based on the GPCE value. The new probabilistic model 
will most likely benefit from adopting this approach. 

D. OFFICIAL CROSS-TRACK FORECAST ERRORS CONDITIONED ON 

GPCE VALUE 

The OFCL cross-track error is a component of the total-track error that can be 
used as a measure of whether the forecast is to the left or right of the verifying position 
(Figure 15). The errors can be positive (which represents a forecast that was to the left) 
or negative (which represents a forecast that was to the right). Using the same method as 
the previous sections, the OFCL cross-track forecast errors were binned into three 
distributions based on the corresponding tercile GPCE value. 
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The results of the comparisons are presented in Table 3, and the tercile 
comparison histograms are displayed in Figures 26 through 29. Linear regression was 
not performed on the cross-track errors for the same reason it was not performed on the 
along-track errors. Although the tests for differences in means are displayed in Table 3, 
the results are not as important since the negative and positive values tend to cancel. 
Therefore, the values are not highlighted as pass or fail in Table 3. However, the means 
will be used to detennine whether a bias exists in the OFCL cross-track forecasts. The 
key statistical test for these distributions is the differences in variances, since it is the 
variance that impacts the probabilistic output of the MC model when it draws from the 
cross-track forecast error distributions. 

Cross-track errors for the OFCL forecasts are usually smaller in magnitude than 
the along-track errors. Therefore, the range of the distributions will be smaller than for 
the along-track distributions. Because of this smaller variability, some of the tests for 
differences in cross-track error variances have a slightly higher P-value than for the 
cross-track results. 
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Table 3. The tercile comparison table for the OFCL cross-track forecast errors 
conditioned on GPCE values. The legend in the upper right portion of the 
table defines the color scheme and tercile definitions. 


Forecast Interval 

12-H Forecast 

OFCL Cross-Track Forecast Errors 
Conditioned on GPCE Value 

Tercile 

Lower 

Middle 

Upper 

Samples 

157 

159 

154 

Legend 

Mean (n mi) 

-3.1 

-1.9 

-5.5 

Statistically 

Different 

(at 0.05 alpha) 

Within 1% C.L.of 

Pass/Fail 


Standard Deviation (n mi) 

18.2 

28.3 

28 

Total Distribution 

„ , .. Standard 

Samples: Mean: _ . . 

470 -3.5 DeV ' at '° n: 

26.2 

Lower(L) - Lower tercile of track forecast errors 
when forecast confidence was high 

Comparison 

L vs. M 

M vs. U 

L vs. U 

Middle(M) - Middle tercile of track forecast 
errors when forecast confidence was 
average 

Test for differences in means 

t-Stat: 0.50 
P: 0.65 

t-Stat: 1.14 
P: 0.26 

t-Stat: 0.90 
P: 0.37 

Test for differences in variances 

F-Stat: 2.41 

P: 0.00 


F-Stat: 2.37 
P: 0.00 

Upper(U) - Upper tercile of track forecast errors 
when forecast confidence was low 

Forecast Interval 

24-H Forecast 

36-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

161 

151 

140 

160 

153 

109 

Mean (n mi) 

-10.3 

-0.1 

-13.4 

-14.8 

-5.6 

-18.4 

Standard Deviation (n mi) 

32.3 

48.2 

52.5 

45.1 

68.8 

79.6 

Total Distribution 

_ , .. Standard 

Samples: Mean: _ . . 

452 -12.4 Dev,at,on: 

oU 

_ , .. Standard 

Samples: Mean: _ . . 

422 -15 Deviation: 

64.3 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 2.17 
P: 0.03 

t-Stat: 2.24 
P: 0.03 

t-Stat: 0.60 
P: 0.55 

t-Stat: 1.38 
P: 0.17 

t-Stat: 1.35 
P: 0.18 

t-Stat: 0.43 
P: 0.67 

Test for differences in variances 

F-Stat: 2.23 
P: 0.00 


F-Stat: 2.65 
P: 0.00 

F-Stat: 2.32 
P: 0.00 

F-Stat: 1.34 
P: 0.05 

F-Stat: 3.11 
P: 0.00 

Forecast Interval 

48-H Forecast 

72-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

154 

147 

89 

152 

103 

57 

Mean (n mi) 

-11.7 

-12.6 

-21.5 

-6 

-53.7 

-5.6 

Standard Deviation (n mi) 

69.8 

94.3 

96.2 

97 

137 

169 

Total Distribution 

„ , .. Standard 

Samples: Mean: _ . .. 

390 -14.2 DeV 0 f'° n: 

85.8 

„ , .. Standard 

Samples: Mean: _ . . 

312 -19.6 Deviation: 

128 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 0.09 
P: 0.93 

t-Stat: 0.69 
P: 0.49 

t-Stat: 0.84 
P: 0.40 

t-Stat: 3.06 
P: 0.00 

t-Stat: 2.27 
P: 0.03 

t-Stat: 0.49 
P: 0.63 

Test for differences in variances 

F-Stat: 1.82 

P: 0.00 


F-Stat: 1.90 

P: 0.00 

F-Stat: 1.99 

P: 0.00 

F-Stat: 1.53 

P: 0.03 

F-Stat: 3.04 

P: 0.00 

Forecast Interval 

96-H Forecast 

120-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

110 

80 

50 

89 

54 

44 

Mean (n mi) 

14.4 

-1.8 

29.3 

32.8 

24.3 

83.3 

Standard Deviation (n mi) 

132 

171 

196 

177 

194 

269 

Total Distribution 

_ , .. Standard 

Samples: Mean: _ . . 

240 12.1 DeV ' at '° n: 

160 

_ , .. Standard 

Samples: Mean: _ . . 

187 43.7 Deviation: 

207 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 0.71 

P: 0.48 

t-Stat: 0.93 
P: 0.36 

t-Stat: 0.49 
P: 0.63 

t-Stat: 0.26 
P: 0.79 

t-Stat: 1.34 
P: 0.18 

t-Stat: 1.27 

P: 0.21 

Test for differences in variances 

F-Stat: 1.68 

P: 0.01 


F-Stat: 2.21 

P: 0.00 p p:p 

F-Stat: 1.93 
P: 0.01 

F-Stat: 2.30 
P: 0.00 
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1. Analysis and Results 

The 12-, 24-, 36-, and 48-h forecast interval tercile comparisons all had similar 
results from the tests for differences in variances. That is, the lower- and middle- tercile 
and the lower- and upper-tercile comparisons passed, while the middle- and upper-tercile 
comparison failed except for the 36-h forecast interval that barely passed with a P-value 
of 0.5 (Table 3). The standard deviations of the middle and upper terciles were nearly 
identical for the 12- and 48-h forecasts, while the 24- and 36-h forecasts had a much 
smaller difference between the middle and upper terciles than they had for the lower- and 
middle-tercile comparisons. The similarities between the middle and upper terciles for 
the forecast intervals between 24 h and 48 h can be seen in Figures 26 and 27. The 
middle and upper terciles have nearly the same ranges, which are significantly different 
from the lower terciles. Along with the same ranges, the middle and upper terciles for the 
24-h histograms even have roughly the same frequency distributions. 

These cross-track error results are consistent with the reasoning discussed earlier 
regarding the short forecast intervals limiting the magnitudes of track errors. Even as 
forecast confidence decreases, the size of potential track error is limited, and this limit 
seems to be reached already when forecast confidence is average. If forecast confidence 
further decreases, the cross-track errors will not significantly increase. The difference 
with the cross-track errors from the along- and total-track errors discussed above seems 
to be that this line of reasoning extends out to two days instead of just one, which may be 
due to the relatively small magnitudes and ranges of cross-track errors when compared to 
along- and total-track errors. When separated by forecast confidence, the middle- and 
upper-tercile distributions have more similarities than the along- and total-track error 
middle- and upper-tercile distributions. 

For all of these four forecast intervals, the cross-track error terciles have a 
negative bias (Table 3). For the 12-, 24-, and 36-h forecasts, the lower and upper tercile 
negative biases are larger than for the middle tercile, while the 48-h upper tercile bias is 
much larger than the biases for the middle and lower terciles. These consistent negative 
biases indicate the OFCL forecasts are to the left of the storm motion. 
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Figure 26. Flistograms of 12- (left column) and 24-h (right column) OFCL cross-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


The 24-, 36- and 48-h forecast intervals all had increasing variances with 
decreasing forecast confidence (Table 3). However, the 12-h forecast interval had nearly 
identical variances for the middle and upper terciles. 
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Figure 27. Histograms of 36- (left column) and 48-h (right column) OFCL cross-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


The 72-, 96-, and 120-h cross-track forecasts all had increasing variances with 
decreasing forecast confidence (Table 3). For all three forecast intervals, the upper 
tercile error distributions were shifted to the right compared to the middle tercile 
distribution (Figures 28 and 29). These distributions indicate that for the large forecast 
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intervals, the OFCL cross-track forecasts have a large bias to the right of the verifying TC 
positions when forecast confidence is low compared to cases with average and high 
forecast confidence. 


The 72-h forecast interval tercile comparisons all passed the tests for differences 
in variances and were negatively biased (Table 3). However, the bias in the cross-track 
forecasts in the middle tercile was much larger than the biases for the lower and upper 
terciles. 


72-Hour 


96-Hour 


a) it 



b) it t j | | | | | i j | | | i i | | i | i | i | i j i | | i i | j | | | 


ziiiiiiMijo:: 


-lot 340 200 22 0 Ito -100 -10 ?0 00 140 2 00 ?t0 320 300 

Forecast Error (n ml) 


-390 - 300 -210 -120 - 30 0 0 ISO 240 330 420 Mor 

it Error (n mi) 


Figure 28. Histograms of 72- (left column) and 96-h (right column) OFCL cross-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 
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Figure 29. 



Histograms of 120-h OFCL cross-track forecast error distributions 
conditioned on a) high, b) average, and c) low forecast confidence. The 
means are represented by the dashed lines. 


The 96-h cross-track forecast middle- and upper-tercile comparison failed the test 
for differences in variances, while the other two comparisons passed (Table 3). This 
accounts for the 0.05 P-value for the middle- and upper-tercile comparison for the 96-h 

total-track errors (Table 1) even though the 96-h along-track errors passed (Table 2). 
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For the 96- and 120-h forecasts, the cross-track forecast bias shifts from negative 
(left) to positive (right), which may be due to the evolution of TCs in the Atlantic basin. 
With few exceptions, a TC will eventually recurve to the north. At these long (96 and 
120-h) forecast intervals that are likely involving the TC final stages, the movement to 
the north starts to increase in speed. Since the biases for these two forecast intervals are 
to the right, it may indicate that the OFCL forecasts tend to exaggerate this northward 
motion. 

Consistent with the along- and total-track tercile comparisons, the 120-h middle- 
and upper-tercile comparison failed the test for differences in variances while the other 
two comparisons passed. As previously discussed, the explanation may be due to the 
large forecast interval. Small track errors are infrequent since even small errors in 
forecast motion of a TC can result in large track errors over 120 h. 

2. Summary 

The differences in variances test resulted in 16 of the 21 comparisons having 
significantly different variances (Table 3). The only cross-track error test failure in the 
120-h lower- and middle-tercile comparison is consistent with both the total- and along- 
track errors (Table 1 and 2). The 36-h middle- and upper-tercile comparison P-value of 
0.5 is consistent with the along-track error test failure for the same comparison. 
Interestingly, the total-track error passed the test for differences in variances for this same 
36-h comparison. The 96-h cross-track error test failure for the middle- and upper-tercile 
comparison is consistent with the same total-track error comparison, which resulted in a 
P-value of 0.05. Finally, the 12- and 48-h tercile comparison failure was unique to cross¬ 
track errors. 

With the exception of the 12-h forecast interval, all other forecast intervals had 
increasing cross-track variance with decreasing forecast confidence, which indicates that 
for nearly all of the forecast intervals the forecast cross-track error becomes less 
predictable as forecast confidence lowers. Similar to the along-track errors, if the MC 
model drew from three cross-track error distributions based on forecast confidence, the 
area covered by each probability interval will increase with decreasing forecast 
confidence. 
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The only discrepancy between the cross-track and along-track comparisons was 
the failures of the middle- and upper-tercile tests for differences in variances. This test 
failed four of the seven forecast intervals and resulted in a 0.05 P-value in another, while 
the lower- and middle-tercile comparisons had only one failure. These failures indicate 
that despite the cross-track variance increasing with decreasing forecast confidence for all 
except the 12-h forecast interval, the new probabilistic model will benefit much more 
from the separation of low from average and high forecast confidence for cross-track 
values than it would from the separation of the high from the low or average forecast 
confidence. 

Considering these tests along with the results of the along- and total-track error 
comparisons, it is clear that OFCL track forecast errors can be successfully stratified by 
forecast confidence based on the GPCE values. The new probabilistic model will most 
likely benefit from adopting this approach. 

E. OFFICIAL TOTAL-TRACK FORECAST ERRORS CONDITIONED ON 

GFS ENSEMBLE SPREADS 

The OFCL total-track errors (equation 1) were also binned into three distributions 
based on the corresponding GFS ensemble spreads. The GFS ensemble spread is defined 
as the average distance of the ten individual members (GFS Positive One through Five 
and Negative One through Five) track positions from the mean track position. The GFS 
ensemble spread values were then divided into terciles of low, average, and large GFS 
ensemble spread to represent high, average, and low forecast confidence. As mentioned 
earlier, fewer forecast errors are in these samples due to the limited availability of the 
GFS ensemble in the A-Decks compared to the availability of the CONU model. 

The results of these comparisons are summarized in Table 4. In addition, tercile 
comparison histograms (Figures 30 through 33) are compared to see if they follow the 
same hypothesized progression in skewness from high forecast confidence to low 
confidence as discussed in Section B. 
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Table 4. The tercile comparison table for the OFCL total-track forecast errors 
conditioned on GFS ensemble spreads. The legend in the upper right portion 
of the table defines the color scheme and tercile definitions. 


Forecast Interval 

12-H Forecast 

OFCL Total-Track Forecast Errors 
Conditioned on GFS Ensemble Spread 

Tercile 

Lower 

Middle 

Upper 

Samples 

94 

101 

105 

Legend 


Mean (n mi) 

32.1 

38.1 

43.2 

Statistically 

Different 

(at 0.05 alpha) 

Within 1% C.L.of 

Pass/Fail 


Standard Deviation (n mi) 

25.1 

25.7 

22.9 

Total Distribution 

Samples: 300 Mean: 38 SD: 29.4 

R: 0.20 R 2 : 0.04 P: 0 

Lower(L) - Lower tercile of track forecast errors 
when forecast confidence was high 

Comparison 

L vs. M 

M vs. U 

L vs. U 

Middle(M) - Middle tercile of track forecast 
errors when forecast confidence was 
average 

Test for differences in means 



t-Stat: 3.23 

P: 0.00 

Test for differences in variances 




Upper(U) - Upper tercile of track forecast errors 
when forecast confidence was low 

Forecast Interval 

24-H Forecast 

36-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

91 

97 

99 

82 

89 

89 

Mean (n mi) 

50 

64.3 

68.8 

74.3 

86.7 

93.9 

Standard Deviation (n mi) 

37.3 

39.5 

31.9 

59.6 

55.2 

47.4 

Total Distribution 

Samples: 287 Mean: 61.4 SD: 37 
Coorelation: 0.15, 0.02, 0.01 

Samples: 260 Mean: 85.3 SD: 47.4 

R: 0.12 R 2 : 0.01 P: 0.04 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 

t-Stat: 2.55 

P: 0.01 


t-Stat: 3.72 

P: 0.00 



t-Stat: 2.37 

P: 0.02 


Test for differences in variances 


F-Stat: 1.53 

P: 0.02 


F-Stat: 1.58 
P: 0.02 


Forecast Interval 


48-H Forecast 


72-H Forecast 


Tercile 


Lower 


Middle 


Upper 


Lower 


Middle 


Upper 


Samples 


72 


72 


72 


51 


51 


50 


Mean (n mi) 


87.5 


98.3 


127 


122 


177 


145 


Standard Deviation (n mi) 


52.9 


67.4 


77.4 


70 


122 


113 


Total Distribution 


Samples: 216 Mean: 104 SD: 68.4 
R: 0.20 R 2 : 0.03 P: 0 


Samples: 152 Mean: 148 SD: 105 
R: 0.03 R 2 : 0.01 P: 0.68 


Comparison 


L vs. M 


Test for differences in means 


Test for differences in variances 



L vs. U 


L vs. M 


L vs. U 


t-Stat: 3.54 
P: 0.00 


t-Stat: 2.82 
P: 0.01 


Forecast Interval 


F-Stat: 2.14 

P: 0.00 


F-Stat: 3.32 
P: 0.00 



96-H Forecast 


120-H Forecast 


Tercile 


Lower 


Middle 


Upper 


Lower 


Middle 


Upper 


Samples 


32 


31 


32 


18 


18 


18 


Mean (n mi) 


184 


241 


250 


294 


199 


426 


Standard Deviation (n mi) 


96 


150 


196 


216 


98 


282 


Total Distribution 


Samples: 95 Mean: 306 SD: 154 
R: 0.15 R 2 : 0.01 P: 0.14 


Samples: 54 Mean: 306 SD: 229 
R: 0.36 R 2 : 0.12 P: 0.01 



61 






















































































1. Analysis and Results 

The 12-h forecast tercile comparisons only passed one test, which was the lower- 
and upper-terciles difference in means (Table 4). The lower and middle terciles are 
nearly identical in means, standard deviations, shapes, and skewness (Figure 30, 12-h 
panels a, b, and c). The upper tercile distribution differs in mean total-track errors from 
the lower tercile and is more skewed than both the lower and middle terciles. However, 
the skewness does not follow the hypothesized progression from high to low forecast 
confidence as all three terciles are skewed to the right. 

For the 24-h forecast tercile comparisons, the middle- and upper-tercile 
comparison failed the test for difference in means, while the lower- and middle-tercile 
and the lower- and upper-tercile comparisons failed the difference in variances test (Table 
4). In this forecast interval, the skewness does follow the hypothesized progression from 
high to low forecast confidence (Figure 30, 24-h panels a to c). 

The tests for the 36-h forecast interval resulted in the lower- and upper-tercile 
comparison tests passing for difference in means and difference in variances (Table 4). 
However, the other comparisons failed both tests. Whereas the high confidence tercile is 
skewed to the right, the middle and low confidence terciles are also skewed to the right 
but to a lesser extent (Figure 31, 36-h panels a to c). The middle and upper terciles do 
not follow the hypothesized progression from high to low forecast confidence. 

The 12-, 24-, and 36-h forecast interval means increase with decreasing forecast 
confidence (Figures 31 and 32), but the variances do not (Table 4). For each forecast 
interval, the upper tercile variance is less than the middle- and lower-tercile variances. 

The 48-h forecast interval had the best results of the seven forecast intervals. 
Only the lower- and middle-terciles comparison test for difference in means and the 
middle- and upper-terciles test for difference in variances failed. However, all of the 
tercile distributions are nearly symmetric and thus do not follow the hypothetical 
progression (Figure 32, 48-h panels a to c). The 48-h forecast error means and variances 
do increase with decreasing forecast confidence. 
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Figure 30. Flistograms of 12- (left column) and 24-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


For the 72-, 96-, and 120-h forecasts, each has three failures out of the six 
comparison tests (Table 4). The error distributions become more random (Figures 32 and 
33) and have smaller samples. While the 72- and 96-h histograms have enough samples 
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to create reasonable histograms, the 120-h interval does not (Figures 32 and 33) since 
only 18 samples per tercile are available for the 120-h tercile comparison. 



Figure 31. Histograms of 36- (left column) and 48-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 
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Figure 32. Flistograms of 72- (left column) and 96-h (right column) OFCL total-track 
forecast error distributions conditioned on a) high, b) average, and c) low 
forecast confidence. The means are represented by the dashed lines. 


The skewness of the 72-h tercile distributions are to the right for the middle and 
low terciles and somewhat centered for the high tercile. While the 96-h terciles have two 
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highly skewed distributions in the middle and upper terciles, lower tercile has a smaller 
right-skewed distribution. The 120-h tercile distributions do not have enough samples to 
determine skewness. 



Figure 33. Flistograms of 120-h OFCL total-track forecast error distributions 
conditioned on a) high, b) average, and c) low forecast confidence. The 
means are represented by the dashed lines. 
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In the 72-, 96-, and 120-h forecast intervals, only the 96-h interval has increasing 
means and variances for decreasing forecast confidence (Table 4). The 72-h forecast 
interval low confidence tercile has a smaller mean and variance than the middle tercile, 
while in the 120-h forecast interval the middle tercile mean and variance are smaller than 
the other two. 

2. Summary 

The tests for differences in means resulted in only 8 out of 21 tercile comparisons 
having significantly different means. The tests for differences in variances were only 
slightly better with 10 of 21 comparisons having significantly different variances. The 
means and variances only increased steadily with decreasing forecast confidence for three 
of the seven forecast intervals. Additionally, only the 24-h forecast interval had the 
hypothesized progression of skewness from high to low forecast confidence. 

The correlations between the GFS ensemble spread and OFCL total-track error 
were low for all forecast intervals with adjusted R values between 0.01 and 0.04, except 
for 120 h which had a value of 0.12. In other words, for the 12- through 96-h forecast 
intervals, only a 1% to 4% variation in the OFCL total-track errors could be explained by 
the variations in the GFS ensemble spread. 

With all these factors combined, it is clear that GFS ensemble spreads are not a 
good indicator of forecast confidence for the OFCL total-track forecast errors. Some of 
the negative results might be explained by the smaller samples with the longer forecast 
intervals. However, even with a larger number of samples in the shorter-range forecast 
intervals, there were few significant differences among the tercile distributions. Based on 
these results for the ensemble mean total-track errors, using the GFS ensemble spread to 
condition OFCL along- and cross- track errors was not examined. 

F. GFS ENSEMBLE MEAN TOTAL-TRACK FORECAST ERRORS 

CONDITIONED ON GFS ENSEMBLE SPREADS 

Since the GFS ensemble spread perfonned poorly as a measure of forecast 
confidence for the OFCL total-track forecast errors, tests were conducted to determine 
whether the GFS ensemble spread is even a good indicator of forecast confidence for 
GFS ensemble mean total-track errors. The GFS ensemble mean total-track errors were 
binned using GFS ensemble spread using the same method discussed in the previous 
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section. The results of the comparisons are summarized in Table 5, and the tercile 
comparison histograms are displayed in Figures 34 through 37. 


Table 5. The tercile comparison table for the GFS ensemble mean total-track forecast 
errors conditioned on GFS ensemble spreads. The legend in the upper right 
portion of the table defines the color scheme and tercile definitions. 


Forecast Interval 

12-H Forecast 

GFS Ensemble Mean Total-Track 

Forecast Errors Conditioned on GFS 

Ensemble Soread 

Tercile 

Lower 

Middle 

Upper 

Samples 

94 

101 

105 

Legend 


Mean (n mi) 

34.1 

45.1 

46.6 

Statistically 

Different 

(at 0.05 alpha) 

Within 1% C.L.of 

Pass/Fail 


Standard Deviation (n mi) 

24.6 

40.9 

25.2 

Total Distribution 

Samples: 300 Mean: 42.2 SD: 31.7 

R: 0.20 R 2 : 0.04 P: 0 

Lower(L) - Lower tercile of track forecast errors 
when forecast confidence was high 

Comparison 

L vs. M 

M vs. U 

L vs. U 

Middle(M) - Middle tercile of track forecast 
errors when forecast confidence was 
average 

Test for differences in means 

t-Stat: 2.44 

P: 0.02 


t-Stat: 3.73 

P: 0.00 

Test for differences in variances 

F-Stat: 2.77 

P: 0.00 

F-Stat: 2.63 

P: 0.00 


Upper(U) - Upper tercile of track forecast errors 
when forecast confidence was low 

Forecast Interval 

24-H Forecast 

36-H Forecast 

Tercile 

Lower 

Middle 

Upper 

Lower 

Middle 

Upper 

Samples 

100 

104 

105 

87 

87 

90 

Mean (n mi) 

54.7 

57.3 

65.6 

68.4 

77.6 

88 

Standard Deviation (n mi) 

33.9 

35.8 

36.7 

43.6 

48.5 

44.5 

Total Distribution 

Samples: 309 Mean: 59.3 SD: 35.7 

R: 0.16 R 2 : 0.02 P: 0 

Samples: 264 Mean: 78.1 SD: 46.1 

R: 0.15 R 2 : 0.02 P: 0.01 

Comparison 

L vs. M 

M vs. U 

L vs. U 

L vs. M 

M vs. U 

L vs. U 

Test for differences in means 



t-Stat: 2.19 

P: 0.01 



t-Stat: 2.96 

P: 0.00 


Test for differences in variances 


Forecast Interval 


48-H Forecast 


72-H Forecast 


Tercile 


Lower 


Middle 


Upper 


Lower 


Middle 


Upper 


Samples 


74 


69 


75 


50 


51 


52 


Mean (n mi) 


80.5 


85.3 


111 


107 


134 


146 


Standard Deviation (n mi) 


51.9 


53.5 


62.5 


53.4 


78.6 


90.6 


Total Distribution 


Samples: 218 Mean: 92.6 SD: 57.6 
R: 0.22 R 2 : 0.04 P: 0 


Samples: 153 Mean: 129 SD: 77.3 
R: 0.20 R 2 : 0.01 P: 0.01 



Forecast Interval 


96-H Forecast 


120-H Forecast 


Tercile 


Lower 


Middle 


Upper 


Lower 


Middle 


Upper 


Samples 


33 


32 


33 


18 


18 


19 


Mean (n mi) 


158 


203 


248 


180 


245 


308 


Standard Deviation (n mi) 


71.4 


106 


160 


141 


109 


167 


Total Distribution 


Samples: 98 Mean: 203 SD: 123 

R: 0.45 R 2 : 0.20 P: 0 


Samples: 55 Mean: 245 SD: 149 
R: 0.41 R 2 : 0.17 P: 0 
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1. Analysis and Results 

The GFS ensemble spreads provided a slightly better measure of forecast 
confidence for the ensemble mean total-track forecast errors than they did for the OFCL 
errors. The tercile histograms in Figures 34 through 37 indicate that for every forecast 
interval the total-track error means increase with decreasing forecast confidence. 



Figure 34. Histograms of 12- (left column) and 24-h (right column) GFS ensemble 
mean total-track forecast error distributions conditioned on a) high, b) 
average, and c) low forecast confidence. The means are represented by the 
dashed lines. 
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Figure 35. Histograms of 36- (left column) and 48-h (right column) GFS ensemble 
mean total-track forecast error distributions conditioned on a) high, b) 
average, and c) low forecast confidence. The means are represented by the 
dashed lines. 


However, the results of the statistical tests were not any better (Table 5) than the 
results from using the GFS ensemble spreads as a measure of forecast confidence to 
condition the OFCL total-track forecast errors (Table 4). The tests for differences in 
means resulted in only 11 of the 21 comparisons having significantly different means, 
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while the tests for differences in variances had only 8 out of 21 comparisons having 
significantly different variances. In addition, the variances do not increase steadily with 
decreasing forecast confidence in three of the seven ensemble mean track error 
distributions. As in the previous section, only the 24-h forecast interval had the 
hypothesized progression of skewness from high to low forecast confidence (Figure 34, 
24-h panels a to c). 



Figure 36. Histograms of 72- (left column) and 96-h (right column) GFS ensemble 
mean total-track forecast error distributions conditioned on a) high, b) 
average, and c) low forecast confidence. The means are represented by the 
dashed lines. 
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Figure 37. Flistograms of 120-h GFS ensemble mean total-track forecast error 
distributions conditioned on a) high, b) average, and c) low forecast 
confidence. The means are represented by the dashed lines. 
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2. Summary 

There was very little indication that GFS ensemble spread is a good measure of 
forecast confidence for the GFS ensemble mean total-track forecast errors. Rather, the 
conclusion is that using the spread of a consensus via GPCE values instead of a single 
model ensemble spread is more reliable in detennining overall forecast confidence of TC 
track forecasts. 

G. DIFFERENT INDICATORS OF FORECAST CONFIDENCE VS TRACK 

ERRORS FOR HURRICANE WILMA 

To further illustrate the differences between using GPCE values rather than GFS 
ensemble spreads as a measure of forecast confidence, the track forecasts for Hurricane 
Wilma (October 2005) were examined. The 72-h forecast interval was chosen since it is 
near the middle of the seven forecast periods and is representative of the differences at 
other forecast intervals. 

The GPCE values and OFCL total-track forecast errors (Figure 38), GFS 
ensemble spreads and OFCL total-track forecast errors (Figure 39), and the GFS 
ensemble spreads and the GFS ensemble mean total-track errors (Figure 40) were 
compared to illustrate how changes in the measures of forecast confidence varied with 
track errors. The highest correlation of 0.51 was between the GPCE values and the 
OFCL total-track forecast errors. The next highest correlation was 0.38 for the GFS 
ensemble spreads and the GFS ensemble mean total-track forecast errors. The lowest 
correlation of 0.25 was between the GFS ensemble spreads and the OFCL total-track 
forecast errors. These results are consistent with the earlier seasonal summaries, and 
illustrate again that the GPCE values provide a better measure of OFCL forecast 
confidence than the GFS ensemble spreads. 

In the two comparisons involving OFCL total-track forecast errors for Hurricane 
Wilma in Figures 38 and 39, the forecast confidence indicators started at a higher value 
and were poorly correlated with the track errors for the first 18 to 24 hours, perhaps 
because the storm was in its early stages when the forecast models normally have less 
skill. Both confidence indicators seem to be out of phase with the track error. 
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72-H Forecast Confidence vs. Track Error comparison 
for Hurricane Wilma (Oct 2005) 


♦ OFCL Total-Track Error 
- A - GPCE Value 





Day and Time (ddhh) 


Figure 38. The 72-h OFCL total-track forecast errors and GPCE values (ordinate) at the 
times for each advisory (abscissa) issued for Hurricane Wilma (2005). 



Figure 39. The 72-h OFCL total-track forecast errors and GFS ensemble spreads 
(ordinate) at the times for each advisory (abscissa) issued for Hurricane 
Wilma (2005). 
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Figure 40. The 72-h GFS ensemble mean total-track forecast errors and GFS ensemble 
spreads (ordinate) at the times for each advisory (abscissa) issued for 
Hurricane Wilma (2005). 

After the poor start, the GPCE values have a reasonably good correlation with the 
OFCL total-track forecast error (Figure 38). For example, a large decrease in track 
forecast error coincided with a similar decrease in GPCE value for the 0600 UTC 19 
October forecast and the 0000 UTC forecast on 21 October. Also the GPCE values were 
a reasonable indicator of the increasing OFCL track forecast errors from the 0600 UTC 
21 October forecast through the last forecast at 1800 UTC 22 October. However, large 
decrease in track error at 1200 UTC 22 October was not predicted by the GPCE value. 
However, such a large departure from the general trend in errors may simply be an 
excellent OFCL forecast that departed from the consensus track guidance. 

Although the GFS ensemble spreads generally increase with OFCL track forecast 
errors (Figure 39), the correlation is not as good for the GPCE values. In some cases, the 
GFS ensemble spread is out of phase with the total-track forecast error. Some examples 
are the 0600 UTC 19 October forecast and the 0000 UTC 21 October forecast in which 
the OFCL error and GPCE value both decreased, but the GFS ensemble spread had 
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decreased 6 hours earlier. The GFS ensemble spreads also have more variability than the 
GPCE values. After 1200 UTC 19 October, the largest variability in the GPCE values 
relative to the general trend is 25 n mi on October 22. By contrast, the GFS ensemble 
spreads routinely vary by 75 n mi from one forecast period to the next, including a 90 n 
mi variation on 22 October. 

The GFS ensemble mean track forecast errors versus GFS ensemble spreads often 
have large variability from one forecast period to the next throughout the period of 
Hurricane Wilma (Figure 40). At times, the GFS ensemble mean errors and spreads are 
in phase, and at other times they are not. This lack of correlation again illustrates that not 
only are the GFS ensemble spreads a poor indicator of OFCL track forecast confidence, 
but also are a poor indicator of its own ensemble mean forecast track error. 
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V. CONCLUSIONS AND RECOMMENDATIONS 


A. CONCLUSIONS 

This study was a first step in an investigation of whether a new version of the MC 
model for occurrences of tropical cyclone-induced wind would be improved by 
conditioning the track error distribution that is used to calculate probabilities to account 
for the track forecast difficulty. If it were possible to use different track error 
distributions for different levels of forecast confidence, the probability wind output may 
be more accurate. For example, a reduction in the massive costs of an overly cautious 
evacuation may be possible when the track forecast confidence is high, or even save lives 
by expanding the evacuation zone when the track forecast confidence is low. 

In this first step, two predictors were tested as measures of track forecast 
confidence. One predictor was the GPCE value, which is calculated from the spread of 
the individual member tracks in the CONU model. The other predictor was the GFS 
ensemble model track spread, which is calculated from the average distance of the 
individual ensemble members from the ensemble mean. The GPCE values were used as 
predictors of the magnitude of the OFCL total-track forecast errors, cross-track forecast 
errors, and along-track forecast errors. Similarly, the GFS ensemble spreads were used as 
an indicator of forecast confidence in the OFCL total-track forecast error and in the GFS 
ensemble mean total-track forecast error. 

Statistical techniques were used to detennine if the means and the variances of 
these track forecast errors were significantly different when the terciles of GPCE values 
and GFS ensemble spreads were used as predictors of forecast confidence. That is, 
would the distributions of track forecast errors be significantly different in three 
distributions of track forecast errors? 

1. Official Total-, Along-, and Cross-Track Forecast Errors Conditioned 
on GPCE Values 

The OFCL total-track forecast errors were examined first to provide a possible 
basis for investigating the along- and cross-track forecast errors. The tests for differences 
in means resulted in 19 of the 21 tercile comparisons having significantly different 
means. Similarly, the results of the tests for differences in variances showed that 18 of 
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the 21 tercile comparisons had significantly different variances (Table 1). Additionally, 
the OFCL track error means and variances increased as forecast confidence decreased for 
all forecast intervals. For all but the 12-h forecast interval, the tercile distributions 
followed the hypothesized progression of increased skewness from high forecast 
confidence to low. The correlations of the GPCE values and track errors at the various 
forecast intervals ranged from 0.29 to 0.41, which were consistent with expectations. 
These results indicate that GPCE values are a good indicator of forecast confidence. This 
test with total-track errors warranted a closer look at the along- and cross-track forecast 
error components of the OFCL forecast errors from which the MC model draws its 
historic track errors. 

The OFCL along-track forecast errors were tested to see if they produced similar 
results to the OFCL total-track forecast errors. The variance tests resulted in 18 of the 21 
tercile comparisons having significantly different variances (Table 2). Two of the 
failures were consistent with the total-track test failures for the 72-h middle- and upper- 
terciles and 120-h lower- and middle-terciles comparisons. Although the failure of the 
36-h middle- and upper-terciles along-track error comparison did not have a 
corresponding failure with total-track errors, that comparison did have a P-value of 0.03. 
Conversely, the 24-h along-track error lower and middle terciles comparison passed 
while the corresponding total-track error terciles comparison failed. As for the OFCL 
total-track forecast errors, the along-track forecast errors increased in variance as forecast 
confidence decreased for all forecast periods. Taking these factors into account along 
with the results of the total-track error comparisons, it is clear that OFCL along-track 
forecast errors can be successfully stratified by forecast confidence based on the GPCE 
values. 

Next, the OFCL cross-track forecast errors were tested to see if they were 
consistent with the results of the total- and along-cross error tercile comparisons. The 
differences in variances test resulted in 16 of the 21 comparisons having significantly 
different variances (Table 3). The only test failures for the cross-track forecast errors 
were also those for the along- and total-track errors, e.g., the 120-h lower- and middle- 
terciles comparison (Table 1 and 2). The 36-h middle- and upper-tercile comparison P- 

value of 0.5 is consistent with the along-track error test failure for the same comparison. 

78 



The 96-h cross-track error test failure for the middle- and upper-tercile comparison is 
consistent with the same total-track error comparison, which resulted in a P-value of 0.05. 
Finally, the 12- and 48-h terciles comparison test failure was unique compared to the 
cross-track errors. Cross-track errors for the OFCL are usually smaller in magnitude than 
the along-track errors. Therefore, the range of the cross-track distributions was smaller 
than the along-track distributions, which means there is less room for variability. This 
difference in variability resulted in some of the tests for differences in variances for the 
along-track errors having a slightly higher P-value than the cross-track results. With the 
exception of the 12-h forecast interval, all other forecast intervals had increasing variance 
with decreasing forecast confidence, which indicates that for the majority of the forecast 
intervals, that as forecast confidence lowers, forecast track error becomes less 
predictable. As for the along-track errors, if the MC model drew from three cross-track 
error distributions based on forecast confidence, the area covered by each probability 
interval should increase with decreasing forecast confidence. 

These tests using the GPCE values as a measure of forecast confidence to 
condition OFCL along- and cross-track forecast errors indicate that using this method 
will successfully stratify the errors into significantly different distributions. The MC 
model will most likely benefit from adopting this approach instead of drawing from just 
one static distribution. 

2. Testing the Effectiveness of GFS Ensemble Spread as an Indicator of 
Forecast Confidence 

Using GFS ensemble spreads as a measure of forecast confidence to condition 
OFCL total-track forecast errors was also tested to see if the results were any better then 
using GPCE values. These GFS ensemble-based tests for differences in means resulted 
in only 8 of 21 comparisons having significantly different means, while the tests for 
differences in variances had only 10 of 21 comparisons having significantly different 
variances. The means and variances only increased steadily with decreasing forecast 
confidence for three of the seven forecast intervals. Additionally, only the 24-h forecast 
interval had the hypothesized progression of skewness from high to low forecast 
confidence. Furthermore, the correlations between the GFS ensemble spread and OFCL 
total-track error were small for all forecast intervals with adjusted R values between 0.01 
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and 0.04, except for the 120-h interval that had a 0.12. These tests indicate that GFS 
ensemble spreads are not a good indicator of forecast confidence for OFCL total-track 
forecast errors. Therefore, using the GFS ensemble spreads to condition OFCL along- 
and cross-track errors was not examined. 

It was also investigated whether the GFS ensemble spread provided an accurate 
indicator of forecast confidence for the ensemble track-error mean. The statistical tests 
indicated no significant improvement. The tests for differences in means resulted in only 
11 of 21 comparisons having significantly different means, while the tests for differences 
in variances had only 8 of 21 comparisons having significantly different variances. In 
addition, the ensemble track error variances did not increase steadily with decreasing 
forecast confidence in three of the seven distributions. The only consistent positive result 
was that the means for the tercile distributions increased with decreasing forecast 
confidence for every forecast interval. 

It is concluded from these two comparisons using the GFS ensemble spread as a 
measure of forecast confidence that the MC model would not benefit from using the GFS 
ensemble spread to stratify OFCL along- and cross-track errors. Rather, it is concluded 
that using the spread of a consensus instead of a single model ensemble spread is more 
reliable method for specifying overall forecast confidence of TC track forecasts. 

3. Summary 

The key result of this thesis is that changing the MC model of the new NHC 
probabilistic product to draw from one of three historic OFCL along- and cross-track 
forecast error distributions conditioned on forecast confidence derived from the GPCE 
values will most likely yield improved results. However, using GFS ensemble spread as 
a measure of forecast confidence will not improve the model, and may actually degrade 
it. 

Such a change in the new probabilistic model is expected to improve the accuracy 
of its probabilistic wind output. This change could lead to a reduction of the massive 
costs of overly cautious evacuations when forecast confidence is high, or even save lives 
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by expanding the evacuation zone when forecast confidence is low. In the end, the 
changes will result in improved TC forecasting support for both the military and civilian 
sectors. 

B. RECOMMENDATIONS 

The next step would be to change the MC model code to draw from three along- 
track and cross-track error distributions based on forecast confidence (GPCE value) 
instead of just one error distribution for all track forecast situations. The model should 
then be tested to see if the probabilistic wind distribution accuracy is significantly 
improved. If so, a pennanent change in the model should be made. 

Future research should concentrate on other factors that may influence historic 
track error distributions. Some of those factors may be the time of year, steering flow 
characteristics, storm intensity, weather regime of the eastern U.S., and TC origin. 
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